A sparsity-aware inference runtime that delivers GPU-class performance on commodity CPUs, purely in software, anywhere.

illustration of neural magic identifying objects in an image

NLP Image

GPUs Are Not Optimal

Machine learning inference has evolved over the years led by GPU advancements. GPUs are fast and powerful, but they can be expensive, have short life spans, and require a lot of electricity.

CPUs Alone Don't Meet the Bar

CPUs are flexible in deployment and more commonly available. But they are generally discounted in the world of ML, due to slow performance as models grow larger.

What if you could have the best of both?

Meet DeepSparse

Accelerate Deep Learning on CPUs

DeepSparse achieves its performance using breakthrough algorithms that reduce the computation needed for neural network execution and accelerate the resulting memory-bound computation.

On commodity CPUs, the DeepSparse architecture is designed to emulate the way the human brain computes:

  1. It uses sparsity to reduce the number of floating-point operations.
  2. It uses the CPU’s large fast caches to provide locality of reference, executing the network depth-wise and asynchronously.



Run Anywhere
Have the freedom to run inference workloads anywhere, from on-premise, to the edge, to the cloud.

SOTA Performance
Deploy and manage DeepSparse with the same systems and processes as your typical CPU-based application workloads.

Reduce Costs
Lower hardware expenses by running on standard CPUs.

Product Overview

GPU-Class Performance
Order-of-magnitude throughput and latency gains over CPU-based alternatives for CNNs and Transformers.
Python and C++ APIs
Available as a PyPI package or C++ binary for easy integration to your application of choice.
DeepSparse Pipelines
Pre-made inference pipelines for common CV and NLP tasks.
DeepSparse Server
Integrated inference server for easy creation of REST endpoints running DeepSparse.
Flexible Inference Modes
Achieve maximum latency with single-stream inference or process inferences concurrently with the NUMA-aware engine to hit latency goals.
Many Core Scaling
Scale across large heterogeneous systems (including multi-socket) to fully utilize CPU hardware.
x86 and ARM Support
Deploy anywhere a CPU exists - in the cloud, at the edge, and in the data center across Intel, AMD, and ARM.
Low Compute, Edge Optimized
Reduce compute and size of your models with sparsification to truly deploy anywhere.

Start inferencing with DeepSparse on your own infrastructure today.