A Software Architecture for the Future of ML

Today’s ML hardware acceleration, whether implemented in silicon (or more recently even using light), is headed towards chips that apply a petaflop of compute to a cell phone size memory. Our brains, on the other hand, are biologically the equivalent of applying a cell phone of compute to a petabyte of memory1. In this sense, the direction being taken by hardware designers is the opposite of that proven out by nature. Why? Simply because we don’t know the algorithms nature uses. 

But here is what we do know about computation in neural tissue: 

  1. Its connectivity graph is extremely sparse.
  2. Its computation has locality in data transfers: When a neuron fires, the receiving neuron is right by it, so there is low communication overhead.
  3. It does not compute on the full graph every time, but rather in select regions.

GPUs bring data in and out quickly, but have little locality of reference because of their small caches. They are geared towards applying a lot of compute to little data, not little compute to a lot of data. The networks designed to run on them therefore execute full layer after full layer in order to saturate their computational pipeline (see Figure 1 below). In order to deal with large models, given their small memory size (tens of gigabytes), GPUs are grouped together and models are distributed across them, creating a complex and painful software stack, complicated by the need to deal with many levels of communication and synchronization among separate machines.

figure1

CPUs, on the other hand, have large, much faster caches than GPUs, and have an abundance of memory (terabytes). A typical CPU server can have memory equivalent to tens or even hundreds of GPUs. CPUs are perfect for a brain-like ML world in which parts of an extremely large network are executed piecemeal, as needed. 

The problem is that today’s neural network software architectures do not fit CPUs (see Figure 2 below). The standard layer-after-layer execution of neural networks works well with GPUs and other hardware accelerators by utilizing their thousands of tiny synchronous cores fully across each layer. It is, however, unsuccessful at delivering performance on CPUs. This is because CPUs typically have less overall compute, and repeatedly reading and writing complete layers to memory incurs the CPU’s lower memory bandwidth. This approach also makes no use of the CPU advantages: powerful asynchronous cores with larger and much faster caches.

figure2

The Revolutionary Deep Sparse Software Architecture 

Neural Magic’s Deep Sparse architecture (see right-hand side of Figure 2 above) is designed to mimic, on commodity hardware, the way brains compute. It uses neural network sparsity combined with locality of communication by utilizing the CPU’s large fast caches and its very large memory. 

Sparsification through pruning is a broadly studied ML technique, allowing reductions of 10x or more in the size and the theoretical compute needed to execute a neural network, without losing much accuracy. So, while a GPU runs networks faster using more FLOPs, Neural Magic runs them faster via a reduction in the necessary FLOPs. Our Sparsify and SparseML tools allow us to easily reach industry leading levels of sparsity while preserving baseline accuracy, and the DeepSparse Engine’s breakthrough sparse kernels execute this computation effectively. 

But once FLOPs are reduced, the computation becomes more “memory bound”, that is, there is less compute per data item, so the cost of data movement in and out of memory becomes crucially important. Some hardware accelerator vendors propose data-flow architectures to overcome this problem. Neural Magic solves this on commodity CPU hardware by radically changing the neural network software execution architecture. Perhaps magically, rather than execute the network layer after layer, we are able to execute the neural network in depth-wise stripes we call Tensor Columns. Tensor Columns mimic the locality of the brain using the locality of reference of the CPUs cache hierarchy. Each tensor column stays completely in the CPU’s large fast caches along the full execution length of the neural network, wherein the column the outputs of a small section of a layer of neurons wait in cache for the next layer as the sparse execution unfolds depth-wise. In this way, we almost completely avoid moving data in and out of memory.

The Deep Sparse software architecture allows Neural Magic to deliver GPU-class performance on CPUs. We can deliver neural network performance all the way from low power, sparse, in cache computation on mobile and edge devices, to large footprint models in the shared memory of multi-socket servers. Neural Magic architecture allows the CPU’s general-purpose flexibility to open the field to developing the neural networks of tomorrow: Enormous models that mimic associative memory by storing information without having to everytime execute the full network, layer after layer.

1 Learn more here.


GPU — 1000s of synchronous cores, tiny slow caches, but 10x higher memory bandwidth.

CPU — 10s of powerful asynchronous cores, lower memory bandwidth, but large 10x faster caches.

GPU — Standard execution, synchronously layer by layer to maximize parallelism, reading and writing to memory using high bandwidth.

GPU — Standard execution, synchronously layer by layer to maximize parallelism, reading and writing to memory using high bandwidth.

CPU — Standard execution works poorly: compute missing, and reading and writing layers to memory doesn’t perform well due to CPU’s low bandwidth.

GPU — Standard execution, synchronously layer by layer to maximize parallelism, reading and writing to memory using high bandwidth.

CPU — Deeply sparsify the network to reduce compute

GPU — Standard execution, synchronously layer by layer to maximize parallelism, reading and writing to memory using high bandwidth.

CPU — Deeply sparsify the network to reduce compute… and execute it depth-wise, asynchronously, and fully inside the large, fast CPU caches.