GPU Speeds without GPUs: Announcing the Neural Magic Inference Engine

neuralmagic
November 7, 2019

This is big. After months of effort from our engineering team, we are proud to announce the first version of the Neural Magic Inference Engine, offering GPU-class performance on commodity CPUs for computer vision and recommendation use cases. 

We founded Neural Magic in 2018 to approach the computational challenges of deep learning without the need for costly GPUs or other hardware accelerators. Our software makes it possible to run deep learning models exceptionally fast on general purpose Intel CPUs, making AI innovation more affordable, widely accessible and flexible for every data science team.

No Hardware AI

The Neural Magic Inference Engine lets data scientists take advantage of the abundant, available compute resources they already have, rather than invest in expensive, specialized AI hardware. We take advantage of the natural sparsity and unique structure of deep learning models to deliver breakthrough performance without sacrificing accuracy, eliminating the crucial tradeoff for data scientists. 

The Neural Magic Inference Engine is a pure software runtime that is downloaded and installed in user space. It fits seamlessly into existing CI/CD pipelines, can be deployed in containers or virtual machines, and can be managed with Kubernetes like any modern software application. Using Neural Magic, machine learning engineers can deploy and scale out deep learning applications quickly and easily. Today we offer support for recommendation networks, including DLRM, and image classification, such as ResNet50 and Mobilenet. Support for additional computer vision use cases, including object detection and image segmentation, are on our roadmap.

The Neural Magic’s Inference Engine offers machine learning teams:

  • Lower costs, up to 10X in some instances, by running deep learning models at scale on commodity CPU resources 
  • GPU-class performance without sacrificing accuracy, by rethinking how convolutional neural networks can be executed more efficiently
  • Unmatched flexibility of a software solution that works with the tools you already use and can be deployed where you need it – on-premise, in the cloud, or at the edge.

A different approach

We are a team of MIT computer scientists that have been studying multicore processing and machine learning for years. Our breakthrough came during our connectomics research in 2017. Rather than ship massive amounts — 0.5 terabytes an hour — of electron microscopy data back and forth to cloud GPUs to map neural pathways in brain tissue, we discovered a way to restructure neural networks to run exceptionally fast on the multicore server in our lab.  Our results demonstrated to us that high performance execution of deep learning models is, fundamentally, an algorithms and systems engineering problem. 

Not surprisingly, this is a scenario that has played out over and over again in the history of computing. A new computational need arises and the market runs towards specialized hardware and chip designs to address these “new” needs. But, ultimately, software solutions running on general purpose compute resources unlock much larger opportunities and win the day.

Sign up for Early Access to the Neural Magic Inference Engine 

Today, Neural Magic works with image classification and fully connected networks.  The software works on the ONNX file output from the leading machine learning frameworks – including Pytorch, Tensorflow and Caffe.

Sign up for early access on Neural Magic’s homepage and see for yourself how Neural Magic delivers GPU-class performance on general purpose, commodity CPUs .