A sparsity-aware inference server for CPUs. Maximize your CPU infrastructure with DeepSparse to run performant computer vision (CV), natural language processing (NLP), and large language models (LLMs).


Can CPUs Be Enough to Support AI Inferencing?

GPUs are not needed for all deep learning use cases.

While GPUs are fast and powerful, they can be overkill in certain inferencing scenarios and limit organizations from leveraging existing CPU hardware for deep learning workloads.

CPUs alone don’t meet the bar for deep learning.

CPUs are flexible in deployment and more commonly available. But they are generally discounted in the world of deep learning, due to slow performance as models grow larger.


how it works

Accelerate Deep Learning on CPUs

DeepSparse achieves its performance using breakthrough algorithms that reduce the computation needed for neural network execution and accelerate the resulting memory-bound computation.

On commodity CPUs, the DeepSparse architecture is designed to emulate the way the human brain computes. It uses:

  1. Sparsity to reduce the number of floating-point operations.
  2. The CPU’s large fast caches to provide locality of reference, executing the network depth-wise and asynchronously.

product overview

Wide Range of Features

Python and C++ APIs

Available as a PyPI package or C++ binary for easy integration to your application of choice.

DeepSparse Server

Integrated inference server for easy creation of REST endpoints running DeepSparse.

Flexible Inference Modes

Achieve maximum latency with single-stream inference or process inferences concurrently with the NUMA-aware engine to hit latency goals.

Many Core Scaling

Scale across large heterogeneous systems (including multi-socket) to fully utilize CPU hardware.

x86 and ARM Support

Deploy anywhere a CPU exists - in the cloud, at the edge, and in the data center across Intel, AMD, and ARM.

Low Compute, Edge Optimized

Reduce compute and size of your models with sparsification to truly deploy anywhere.