Machine Learning Inference: Why Use GPUs?


Or other domain-specific chipsets, for that matter?

In the machine learning inference phase, training is complete and it’s time for a model to do its job: make predictions based on the incoming data. In other words, the model has learned all of the “assumptions” it needs to know to make predictions for the task at hand, whether it’s image recognition, recommendation engines, or choosing which video game move to make next (just as a few examples).

Many analyst firms define the market for deep learning inference based on chipset growth, since AI acceleration has been largely hardware dominant. For example, according to Tractica Research, the market for inference chipsets is expected to reach $52 billion by 2025, out of a total chipset market of $66.3 billion (or 79% of the total market). According to Intel’sVice President of AI Products Group Gadi Singer, there will be “a clear shift in the ratio between cycles of training and inference from 1:1 in the early days of deep learning, to a projected ratio of well over 1:5, favoring inference, by 2020.”

In other words, machine learning is on the cusp of a new phase of maturity where it’s time to stop training and get to work.

machine learning inference in neural networks

Critical Decision Criteria for Machine Learning Inference

Some of the biggest decision criteria in the machine learning inference phase of work are the speed, efficiency, and accuracy of these predictions. If a model can’t process data fast enough, it becomes a theoretical exercise that can’t be used effectively in the real world. If it consumes too much energy, it becomes too costly to operate in production. And finally, if the accuracy is poor, a data science team can’t justify the continued use of the model.

Inference speed, in particular, can be a bottleneck in certain scenarios and examples, which can include:

  • Image Classification, used in many applications including social media and image search engines. Even though tasks are relatively simple, speed is critical particularly in matters of public safety or platform violations.
  • Large Image Processing, such as high resolution video processing. Edge computing or real-time examples, include self-driving cars, recommendations on commerce sites, or real-time internet traffic routing.
  • High Volumes of Images and Videos, including object identification within 24x7 video streams. For example, Tractica Research estimates that surveillance videos will be among the top 10 applications of computer vision technology by 2025.
  • Complex Images or Tasks, for example pathology and medical imagery. These are among the most complex images to analyze. Today, data scientists must fragment images into smaller tiles to gain incremental speed or accuracy advantages from a GPU. These scenarios require both a reduction in inference speed and an increase in accuracy.

Many data scientists who are working in these scenarios may start with CPUs, since inference is typically not as resource-heavy as training. As inference speed becomes a bottleneck, some may resort to using GPUs or other specialty hardware to achieve the performance or accuracy gains they require.

machine learning inference chip burnout

Incremental Progress and Tradeoff Issues

Even though machine learning has experienced incremental gains in performance with GPUs, to make the next great leap in capability, bigger models must be processed faster and with a higher degree of accuracy. Neural networks and the data they run on will only get bigger with time. For example, according to a recent talk from NVIDIA’s Chief Scientist Bill Dally, ImageNet is now considered a small dataset and some cloud data centers train on more than one billion images, using upwards of 1,000 GPUs. Microsoft’s ResNet-50 neural network requires 7.72 billion operations to process one low-resolution (225x225) image.

For many organizations, domain-specific hardware options for machine learning inference tend to be too costly to put these types of model into production at scale. In order to cut costs, the value of the model itself is often degraded in to make it production-ready. Even if an organization can swallow the enormous costs, they are limited to deploying these models in specialized data centers, or on expensive resources in the cloud. In other words, they cannot build the models into low-power mobile devices, edge computing systems, or systems with intermittent connectivity (or systems that are disconnected altogether).

Depending on the application, batch size is a variable that impacts the speed of inference. Many domain-specific hardware like GPUs are performance-optimized for a batch size 64 or greater, which is great for real-time analysis (e.g. batched data such as social media images or voice data streams) but not ideal for situations where teams need to wait to assemble enough images to fill a batch (e.g. medical image scans). The ability to run batch size one without sacrificing speed and accuracy means more (and faster) predictions.

Unlock Machine Learning Inference from the Chipset

Domain-specific hardware chipsets, including GPUs, FPGAs, TPUs and more, to date, have been the proverbial sledgehammer to which everything else looks like a nail. Many organizations don’t consider CPUs a viable option because they haven’t performed to the level domain-specific hardware. It doesn’t have to be this way if CPUs can achieve both performance and value in the inference phase.

The next great “unlock” for machine learning will be unlocking inference from the chipset, letting teams use the massive computational power available in the ubiquitous hardware they already have — without making expensive, dedicated investments in technology that only provides incremental improvements.

Note: A version of this article originally ran on our Medium publication. Follow along @LimitlessAI.


Neural Magic is powering bigger inputs, bigger models, and better predictions. The company’s software lets machine learning teams run deep learning models at GPU speeds or better on commodity CPU hardware, at a fraction of the cost. To learn more, visit