A sparsity-aware inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application.

pip install deepsparse

deepsparse.benchmark "zoo:nlp/question_answering/distilbert-none/pytorch/huggingface/squad/pruned80_quant-none-vnni"

GPUs Are Not Optimal

Machine learning inference has evolved over the years led by GPU advancements. GPUs are fast and powerful, but can be expensive, have short life spans, and require a lot of electricity.

CPUs Are Set for Failure

CPUs are flexible in deployment and more commonly available, but have generally been discounted in the world of ML. The way current models are developed doesn’t suit CPU’s architecture.

What if you could have the best of both worlds?


Machine Learning Execution Reimagined

DeepSparse achieves its performance using breakthrough algorithms that reduce the computation needed for neural network execution and accelerate the resulting memory-bound computation.

DeepSparse architecture is designed to mimic, on commodity CPUs, the way brains compute:

  1. It uses sparsity to reduce the need for flops
  2. It uses the CPU’s large fast caches to provide locality of reference, executing the network depthwise and asynchronously.

Learn more about the Deep Sparse technology -->>

Run Anywhere

Deploy with flexibility on premise, in the cloud, or at the edge.

SOTA Performance

Run models on CPUs at GPU speeds. No special hardware required.

Reduce Costs

Maximize model execution techniques to save on compute expenses.


Proprietary Tensor Column Infrastructure
Leverage Tensor Columns to reduce data movement to achieve exceptional performance
Single-Stream Inference
Achieve maximum latency performance for a single inference pipeline
Multi-Socket Deployment
Scale across large heterogeneous systems to fully utilize CPU hardware
Low Compute, Edge-Optimized
Reduce compute and size of your models with sparsification to truly deploy anywhere
Deployment Flexibility
Deploy anywhere a CPU exists - in the cloud, at the edge, and in the datecenter
Connect Any Application
Inference with the runtime at the level you need for performance: Python and C++ bindings available
Multi-Stream Inference
Schedule inferences concurrently with the NUMA-aware runtime to hit latency and throughput goals
Model Analysis
Analyze your model with layer-by-layer performance breakdowns

Start inferencing with DeepSparse on your own infrastructure today.