Challenging Memory Requirements and Performance Standards in ML

Photo Credit: Thomas Kelley (via Unsplash)

Everything we know about memory requirements in machine learning may be wrong. 

Today, when data scientists process deep learning models using a “throughput computing” device like a GPU, TPU, or similar hardware accelerator, they’re likely faced with a decision to shrink their model or input size to fit within the device’s memory limitations. Training a large, deep neural network (or even a wide, shallow one) on a single GPU, in many cases, may be impossible.

Ever wonder why on the original Resnet 152, the winner of the ILSVRC-2015 image detection competition had 152 layers and not 153? Is it a coincidence that the parameters in 152 layers have a memory footprint of slightly less than 12G, while 153 layers go beyond 12G (the standard size of GPU memory at the time)? 

As NVIDIA’s Chief Scientist Bill Dally admits: deep learning remains “completely gated by hardware.” If the images in ImageNet were larger than 224x224, than one would have to reduce the batch size, or cut the number of Resnet model layers, or shrink a high-resolution photograph to a smaller 224x224 image. Any one of these options would most likely result in a loss of prediction accuracy.

Today’s massive GPU or TPU compute pods generate tens or even hundreds of teraflops of performance, but do not really solve this memory problem. When a DGX-2 is said to have 500GB of high bandwidth memory (HBM) memory, this memory is actually split across 16 GPUs. During training, a copy of the model must fit in the 32G of expensive HBM memory in each GPU (model parallel training across GPUs is a myth). The same goes for the Google TPU2 Pod, which is boasted to have 4TB of shared HBM memory. What this actually means is that each of the 256 TPU2 chips has 16GB of memory and the model and parameters must fit in these 16GB during training -- or bye bye speedup. So, on a DGX-2 or a TPU pod, one can train Resnet 152 on ImageNet, but still not Resnet 200. One can run 224x224 size images, but definitely not 4K images.

In other words, as our inputs get larger, as we add video, and as we aim to train deeper and larger models to accurately analyze them, the memory limitations of our dedicated accelerator hardware devices will continue to hinder discovery and innovation.

Why would we allow these limitations to prevent us from developing a Resnet-500 (and beyond)? 

The High Throughput Conundrum

There’s one big reason memory requirements are failing us: Our industry is obsessed with the idea that deep learning hardware, like the brain, is about throughput computing, where more compute power equals better performance. Measuring performance in FLOPS per cycle has become the industry standard, and the accelerators that generate these FLOPS run thousands of parallel compute cores -- which require expensive and size-limited HBM memory. 

In reality, we may be looking at the wrong paradigm and the wrong measurement of success for “brain-like” deep learning hardware. Our brains are actually not throughput devices. Their computations are sparse. Their so-called “massive parallelism” is most likely an artifact of how their program is written in a slow bio-chemical medium rather than the fast silicon circuit available to us.

There is nothing inherent about how brains compute, except perhaps the underlying algorithms.  Because we have not figured out what these algorithms are, we are simply throwing FLOPS at inefficient versions of them (we know today’s models are inefficient because we can greatly prune them and reduce the precision of operations and still get the same accuracy). There is an abundance of evidence that suggests if we can figure out these algorithms, we will not need so many FLOPS.

A New Logic: Forget the FLOPS and Focus on the Algorithms 

Changing the current industry logic around the premise that throughput = performance won’t be easy, but other industries have faced similar challenges where more raw power (or more of the same power metrics) wasn’t necessarily the best approach. Like we’ll see in the examples below, perhaps the solution is to approach performance in a different way altogether.

Light Bulbs: More Watts vs. More Efficient Bulbs. For decades, the accepted measurement of output for a lightbulb was wattage. It was commonly accepted that the higher the wattage of an incandescent bulb, the brighter and more powerful it would be - completely ignoring the inefficiencies (yikes, that’s hot!)  that came along for the ride. However, with the advent of more efficient LED lighting, a lower-wattage bulb can be used to achieve similar performance benchmarks to higher-wattage incandescent counterparts. If the solution remained to build higher and higher wattage bulbs, we’d probably set the world on fire (literally).

Vehicles: More MPG vs. A Different Type of Engine. In vehicles, engine horsepower and miles per gallon (MPG, or range) were the typical industry benchmark measurement for performance. Consumers faced a tradeoff between horsepower and MPG, depending on what they wanted most in a car. Even to this day, consumers and vehicle manufacturers are struggling with this tradeoff, with external factors such as government regulation and oil prices exerting influence on the discussion.

However, for anyone who’s ever watched on the Tesla drag racing videos (don’t do this at home kids) can attest, this is completely a false tradeoff based on the wrong measurements. Hybrid and electric car manufacturers like Tesla have challenged the current vehicle performance paradigm, demonstrating that vehicles can achieve similar range through different power sources, and in some cases, can do so without sacrificing horsepower to achieve that range.

In both of these examples, the industry refused to accept that the current paradigm was not the only path forward. To accelerate the field of machine learning, we must be willing to test our assumptions that more throughput, measured in FLOPs, is the only way for data scientists to achieve performance with bigger models or bigger files. 

What if we could achieve similar performance without compromising size, by making better use of the compute resources we already have? Why wouldn’t we choose to challenge the idea that more FLOPS are better?