Counterintuitive Lessons: How to Improve Machine Learning

*Photo by Clarisse Croset on Unsplash

I would like to put forth an idea that might seem counterintuitive given present-day hype: that human-scale machine learning infrastructures must more closely map to the human brain. In more technical terms, this means that they should be based on modifications to existing commodity Von-Neumann architecture CPUs, rather than on today’s popular “brain-inspired” massively parallel hardware accelerators.

Neural Tissue vs. Neuromorphic Hardware: What We Can Learn

The human cortex has something on the order of 30 billion neurons and 200 trillion synapses, so it is about one petabyte in size even with a minimal digital representation of its connectivity graph. ResNet 153, the star neural network model of the ImageNet classification benchmark, expands to about 10 gigabytes during training. Somewhere in this five-orders-of-magnitude gap is the true size machine learning models will have to reach in the coming decades to enable systems that deliver human-like data analysis capabilities.

The current prevalent roadmap for building hardware systems to run such models is centered on adding massively parallel hardware acceleration capabilities, specially designed for machine learning (ML), into all our computing platforms, be these GPUs, TPUs, ASICS or FPGAs. 

The rush to special hardware that has become so common is propelled by the belief that ML is a “throughput computing” problem that requires a special “neuromorphic” processing approach that mimics our brains: thousands of small cores, each receiving data at high bandwidth, executing a limited set of operations in parallel, and then moving the results onwards for further processing. 

This is a highly appealing mental picture that has propelled the GPU, TPUs and similar specialized massively parallel hardware accelerators to the forefront of attention in the ML world. Here is an example quote, this one from tech writer Robert May: 

“... you have massive innovation in neuromorphic chips ... will take a few years to really take hold, but they will break the frameworks people currently use to think about what products and services can be built with AI.  I’m particularly excited about the innovation that will come from breaking the x86 chip architecture framework we’ve relied on for so long.”

But can hardware-accelerated throughput computing actually deliver on its promise? What we desire is “Big ML”: big models, big data, and big precision. Yet the current trend with throughput computing accelerators like GPUs, TPUs and ASICs seems to be the opposite: they provide more FLOPS at the cost of limited memory, small caches, and decreasing precision, which means they can only operate on small models and small data and with lower 16 bit precision. These are design issues that have to do with the limitations of moving data continuously among the memories of thousands of high-speed cores. Though much research and development are devoted to overcoming these limitations, the question is whether it is worth the effort.

Are X86 FLOPS Good Enough for Our Purposes?

Surprisingly, the performance gap between CPUs and GPUs is not as big as one might think. In terms of raw FLOPS, the most powerful enterprise Nvidia Volta GPU delivers about 15TFLOPS of performance, versus about 3TFLOPS for a commodity 18-core Intel i9 CPU. So the performance gap is 5x, not 50x or 500x as one might imagine, and the price gap is about 4x in favor of the CPU (and, as an added bonus, it uses half the power). Moreover, FLOPS are only part of the story and one can overcome this 5x gap with smarter algorithms.

At Neural Magic, even in our slowest vanilla execution mode, our software delivers the same performance as a 4000-core Nvidia Pascal GPU on a single 10-core Intel i9 desktop. The reason this can be done is a combination of factors:

  1. Amdahl’s Law: In any one of today’s neural networks (and we have no reason to believe future ones will be fundamentally different), a significant fraction of the execution is memory-bound (25% for Alexnet, 50-60% for Resnet), so with 50% of the computation being memory-bound, even an endless number of FLOPS for the compute-intensive part will only get you a 2x speedup.
  2. Cache Size and Pre-fetching: Modern CPUs with say, 20 cores have L2 and L3 caches that are an order of magnitude larger than what can be placed on a GPU with ~4000 cores. This allows for a more sophisticated use of mathematical transforms, like FFT and Winograd, that are limited by the GPU’s small caches. Moreover, with the right algorithmic restructuring of the neural network execution, a modern CPU can pre-fetch data into its large cache and run on it for long periods without ever moving data back to memory.

In short, executing neural networks is not about raw FLOPS. It’s a systems problem like any other, and cache size and memory speed (which favor the CPU) are as critical as the overall FLOP count (which favors the GPU and other specialized hardware).

The Silicone Brain of the Future

The biggest question, then, is whether, when we translate the computations in neural tissue to silicone, we need to mimic the morphology and distributed dynamics of human brain tissue. To my mind, the biggest mistake currently being made by the “neuromorphic computing” community is that they work under the assumption that the brain is a “throughput computer” like a GPU. The reality is that if all your brain’s neurons were firing at once, as in a GPU, you could fry an egg on your head (not to mention that we would be having a major epileptic fit). 

What our brain most likely does (and I say “most likely” because research on this varies) is fire just a fraction of our total neurons at one time, but in a parallel way. This is the only way a set of slow biological and chemical computing units can execute a billion instructions per second. But a modern CPU, executing in silicone, delivers 10s of billions of instructions per second without the need for this same “neuromorphic” parallelism. If we have 10s of billions instructions, we can use them on a CPU, as we wish to emulate the same neural computation. In other words, our silicone hardware does not need to look the same way as the biological one. 

As for the “program” being executed—that is, the AI program described by the neurons in our brain—it is written as a large graph where each neuron is both a memory (synapses) and computing device (action potential). This, again, is an artifact of the chemical and biological infrastructure nature is writing the program in. If we want to represent this program in its entirety with computing, we will need a graph that takes a petabyte of storage to represent. It is highly unlikely that we will be able to create a full silicone circuit that mimics this graph anytime in the next few decades: ASIC size is far from the scale such a representation would require.

The neural programs themselves are changing rapidly from month to month. Right now, we require tera-FLOPS of compute to run the convolutions at the base of our artificial neural networks.

It’s worth noting that the human brain does not seem to function like a deep neural net. We can recognize Mount Fuji in a photo in less than 1/10th of a second, which amounts to less than 20 neurons firing in sequence. This would imply a very wide and shallow sub-network that looks nothing like the artificial ones we are basing our architectural designs on. Perhaps it would be wiser to avoid committing to a new architecture now and instead add features to our existing, well-tested ones? 

And back to the large neural program graph. The only feasible way to store the descriptions of large-scale neural computations (order of 100 billion parameters) is to keep their graph in a CPU’s large DRAM memory. We can make tens of terabytes of DRAM accessible in a shared way on a commodity multi-core CPU server. This number suffices to hold sufficiently large graphs even today. Contrast that with the specialized hardware accelerators that have to use expensive high bandwidth memory (HBM) to allow access to 1000s of cores. Currently, the use of HBM restricts GPUs and TPUs to only tens of Gigabytes of memory: 1000-fold less than that of CPUs. Note that adding DRAM memory to GPUs/TPUs via a CPU across an interconnect is not going to be fast enough anytime soon, because a network being executed or trained must fit in memory.

In Summary: Why Our Best Path Forward is CPUs and Pre-Fetched Memory

So if we place a large graph of a neural circuit in shared CPU DRAM, then the main limitation on performance will not be the lack of compute but rather the time to bring the respective graph parts into memory so they can be executed. If we want to avoid a slowdown, we need to run in cache, which implies having the relevant parts of the graph pre-fetched into the cores’ cache memory in advance. 

In other words, to execute such large graphs, we should push forward the design of CPUs with ever larger caches and the right prefetching mechanisms. While present day pre-fetching is “vector based:” the next data to be used is in the next consecutive cache line, in the future the pre-fetching could be combinatorial, based on graph layouts, or more interestingly, be learned at execution time. 

I believe that the nature of the neural emulation problem—the need for large representations that rely more on data movement than on compute—implies that, rather than develop specialized neuromorphic “throughput”  architectures with small caches and small memory sizes (ultimately restricted by the communication needed to allow access by thousands of cores), we should progress by evolving our existing, well-understood Von Neumann CPU architecture and its large, pre-fetchable DRAM memory.

This is the best path forward for machine learning.