Skip to content

Latest commit





Folders and files

Last commit message
Last commit date

parent directory


YOLOv5 Inference Pipelines

DeepSparse allows accelerated inference, serving, and benchmarking of sparsified Ultralytics YOLOv5 models.
This integration allows for leveraging the DeepSparse Engine to run the sparsified YOLOv5 inference with GPU-class performance directly on the CPU.

The DeepSparse Engine is taking advantage of sparsity within neural networks to reduce compute required as well as accelerate memory-bound workloads. The engine is particularly effective when leveraging sparsification methods such as pruning and quantization. These techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics.

This integration currently supports the original YOLOv5 and updated V6.1 architectures.

Getting Started

Before you start your adventure with the DeepSparse Engine, make sure that your machine is compatible with our hardware requirements.


pip install deepsparse[yolo]

Model Format

By default, to deploy YOLOv5 using DeepSparse Engine it is required to supply the model in the ONNX format. This grants the engine the flexibility to serve any model in a framework-agnostic environment.

Below we describe two possibilities to obtain the required ONNX model.

Exporting the ONNX File From the Contents of a Local Directory

This pathway is relevant if you intend to deploy a model created using the SparseML library. For more information refer to the appropriate YOLOv5 integration documentation in SparseML.

After training your model with SparseML, locate the .pt file for the model you'd like to export and run the SparseML integrated YOLOv5 ONNX export script below.

sparseml.yolov5.export_onnx \
    --weights path/to/your/model \
    --dynamic #Allows for dynamic input shape

This creates a DeepSparse_Deployment folder with a model.onnx file (e.g. runs/train/exp/DeepSparse_Deployment/model.onnx).

SparseZoo Stub

Alternatively, you can skip the process of the ONNX model export by using Neural Magic's SparseZoo. The SparseZoo contains pre-sparsified models and SparseZoo stubs enable you to reference any model on the SparseZoo in a convenient and predictable way. All of DeepSparse's pipelines and APIs can use a SparseZoo stub in place of a local folder. The Deployment APIs examples use SparseZoo stubs to highlight this pathway.

Deployment APIs

DeepSparse provides both a Python Pipeline API and an out-of-the-box model server that can be used for end-to-end inference in either existing Python workflows or as an HTTP endpoint. Both options provide similar specifications for configurations and support annotation serving for all YOLOv5 models.

Python Pipelines

Pipelines are the default interface for running inference with the DeepSparse Engine.

Once a model is obtained, either through SparseML training or directly from SparseZoo, deepsparse.Pipeline can be used to easily facilitate end-to-end inference and deployment of the sparsified neural networks.

If no model is specified to the Pipeline for a given task, the Pipeline will automatically select a pruned and quantized model for the task from the SparseZoo that can be used for accelerated inference. Note that other models in the SparseZoo will have different tradeoffs between speed, size, and accuracy.

DeepSparse Server

As an alternative to Python API, the DeepSparse Server allows you to serve ONNX models and pipelines in HTTP. Both configuring and making requests to the server follow the same parameters and schemas as the Pipelines enabling simple deployment. Once launched, a /docs endpoint is created with full endpoint descriptions and support for making sample requests.

An example of starting and requesting a DeepSparse Server for YOLOv5 is given below.


The Deepsparse Server requirements can be installed by specifying the server extra dependency when installing DeepSparse.

pip install deepsparse[yolo,server]

Deployment Example

The following example uses pipelines to run a pruned and quantized YOLOv5l model for inference, downloaded by default from the SparseZoo. As input the pipeline ingests a list of images and returns for each image the detection boxes in numeric form.

List of the YOLOv5 SparseZoo Models

If you don't have an image ready, pull a sample image down with

wget -O basilica.jpg
from deepsparse import Pipeline

model_stub = "zoo:cv/detection/yolov5-l/pytorch/ultralytics/coco/pruned-aggressive_98"
images = ["basilica.jpg"]

yolo_pipeline = Pipeline.create(

pipeline_outputs = yolo_pipeline(images=images, iou_thres=0.6, conf_thres=0.001)

Annotate CLI

You can also use the annotate command to have the engine save an annotated photo on disk.

deepsparse.object_detection.annotate --source basilica.jpg #Try --source 0 to annotate your live webcam feed

Running the above command will create an annotation-results folder and save the annotated image inside.

original annotated

Image annotated with 96% sparse YOLOv5s

If a --model_filepath arg isn't provided, then zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned-aggressive_96 will be used by default.

HTTP Server

Spinning up:

deepsparse.server \
    --task yolo \
    --model_path "zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94"

Making a request:

import requests
import json

url = ''
path = ['basilica.jpg'] # list of images for inference
files = [('request', open(img, 'rb')) for img in path]
resp =, files=files)
annotations = json.loads(resp.text) # dictionary of annotation results
bounding_boxes = annotations["boxes"]
labels = annotations["labels"]


The mission of Neural Magic is to enable GPU-class inference performance on commodity CPUs. Want to find out how fast our sparse YOLOv5 ONNX models perform inference? You can quickly do benchmarking tests on your own with a single CLI command!

You only need to provide the model path of a SparseZoo ONNX model or your own local ONNX model to get started:

deepsparse.benchmark \
    zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94 \
    --scenario sync 

>> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94
>> Batch Size: 1
>> Scenario: sync
>> Throughput (items/sec): 74.0355
>> Latency Mean (ms/batch): 13.4924
>> Latency Median (ms/batch): 13.4177
>> Latency Std (ms/batch): 0.2166
>> Iterations: 741

To learn more about benchmarking, refer to the appropriate documentation. Also, check out our Benchmarking tutorial!


For a deeper dive into using YOLOv5 within the Neural Magic ecosystem, refer to the detailed tutorials on our website.


For Neural Magic Support, sign up or log in to our Deep Sparse Community Slack. Bugs, feature requests, or additional questions can also be posted to our GitHub Issue Queue.