Our Deep Sparse Technology


Today’s standard layer-after-layer execution of neural networks works well with GPUs and other hardware accelerators by utilizing their thousands of tiny synchronous cores across each layer. It has, however, been unsuccessful at delivering performance on CPUs. This is because CPUs typically have less overall compute, and repeatedly reading and writing complete layers to memory incurs the CPU’s lower memory bandwidth. This approach also makes no use of the CPUs advantages: powerful asynchronous cores with large caches.

Hardware Accelerators — 1000s of weak synchronous cores with tiny caches and high memory bandwidth.

CPU — 10s of powerful asynchronous cores with large caches and low memory bandwidth.


At Neural Magic, we deliver GPU-class performance on CPUs by leveraging two properties of actual brains:

  • Their connectivity graph is extremely sparse and
  • Their computation is highly localized: when a neuron fires, the receiving neuron is right by it, so there is low communication overhead.

 
Neural Magic has mimicked these two natural brain properties in its Deep Sparse execution technology:

Hardware Accelerators — Standard execution, synchronously layer by layer to maximize parallelism, reading and writing to memory using high bandwidth.

Hardware Accelerators — Standard execution, synchronously layer by layer to maximize parallelism, reading and writing to memory using high bandwidth.

CPU — Standard execution works poorly: compute missing, and reading and writing layers to memory doesn’t perform well due to CPU’s low bandwidth.

Hardware Accelerators — Standard execution, synchronously layer by layer to maximize parallelism, reading and writing to memory using high bandwidth.

CPU — Deeply sparsify the network to reduce compute

Hardware Accelerators — Standard execution, synchronously layer by layer to maximize parallelism, reading and writing to memory using high bandwidth.

CPU — Deeply sparsify the network to reduce compute… and execute it depth-wise, asynchronously, and fully inside the large CPU caches.


Sparsification through pruning is a broadly studied ML technique, allowing reductions of 10x or more in the size and theoretical compute needed to execute a neural network, without losing much accuracy. Our Sparsify and SparseML tools allow us to easily reach industry leading levels of sparsity while preserving baseline accuracy.

In addition, our DeepSparse Engine’s breakthrough sparse kernels execute this sparse computation effectively. The deeply sparsified computation is memory bound, which is unfortunately not good for a CPU. Our solution to this memory boundedness is to execute the neural network depth-wise rather than layer-after-layer. It might seem like magic, but we are able to break the network into Tensor Columns, vertical stripes of computation that fit completely in cache without having to read or write to memory. Tensor Columns mimic the locality of the brain using the locality of reference of the CPUs cache hierarchy: the outputs of the column’s small section of a layer of neurons waits in cache for the next layer as the sparse execution unfolds depth-wise.

Sparse computation, executed depth-wise in cache, allows us to deliver GPU-class performance on CPUs.