The combination of the right software and commodity hardware will prove capable of handling most machine learning tasks

Earlier this year, Nir Shavit, professor of EECS at MIT and CEO of Neural Magic, joined Ben Lorica for an open discussion on the Data Exchange Podcast. The conversation spanned multicore software, neurobiology and deep learning.

The full episode can be downloaded from iTunes, Android, Spotify, Stitcher, Google, and RSS.

The full transcript follows, lightly edited for clarity.

Ben Lorica: Nir Shavit, professor at MIT, but also co founder and CEO of neuralmagic.com. Welcome to The Data Exchange podcast.

Nir Shavit: Thank you.

Ben: Let’s start with your research in neurobiology. What is the connection between this line of research and AI in industry?

Nir: In recent years, I’ve been at MIT working in a field called connectomics. In connectomics, we take tiny slivers of brain, the size of a grain of salt, from mammals, like mice. Our colleagues at Harvard slice these tiny grains of salt 30,000 times, then image that with an electron microscope. A cubic millimeter of mouse brain gives you about two petabytes of data, which you then run machine learning (ML) algorithms on to extract the connectivity of the tissue. So we’re learning what connectivity looks like in the brain. The thing I’ve learned in this research is that 2 our brain essentially is an extremely sparse computing device.

I’ve taken that understanding into my other area of research—multi-core computing—to try to come up with an understanding of where we should be going with the design of computer hardware for machine learning, what things should look like in the future, based on learnings from my work in connectomics.

Ben: There are people in AI who think that, while it’s interesting to learn how the brain works, they say, “We’re only interested in getting results.” What do you say to people who say that understanding the intricacies of the brain is only useful to the extent that it helps us get better AI systems?

Nir: Well, first of all, I’m with them. That’s fine. That’s a valid goal. I think the first learning we can get from understanding how the brain works is understanding that we don’t really need to build these kinds of massive, tens of petaflop devices to actually solve the problems our brain solves. Our brain is an extremely sparse computing device. Essentially, if you do the calculation, your brain, your cortex does about the same compute as a cell phone. But it does it on a very large graph, a graph that is petabyte size. In the hardware devices we build, the compute is a petaflop of compute on a cell phone worth of device—16 gigs, 32 gigs. If we want to mimic our brain, we should be building something that is like a cell phone of compute on a petabyte of memory, not a petaflop of compute on a cell phone with memory.

This is the understanding I’ve had from this research—that basically we are solving the problem the wrong way. And there is a reason for it: we don’t know the algorithm. We really don’t know what the graph looks like. Typically in computer science, when you don’t know what the problem is, you end up throwing a lot of compute at it. That’s the stage machine learning is really in right now. But it’s a temporary time. So, in this temporary period, we’re building these accelerators—small memory, large compute—because we don’t know how to do the algorithms, but the algorithms are improving every year

This whole thing has been going on since about 2012. We’re seven years in, and in seven years we have EfficientNets from Google. They are something that is the size of ResNet-50 doing something AmoebaNet in size, like MobileNet doing what ResNet does. We’re learning how to make the amount of compute go down by a lot, but we’re still in the business of running on big datasets. We don’t want to do 224 by 224 images; we want to do 30 frames of 4K video. To do that, I need a lot of memory.

How do I do it with today’s hardware devices? We’re told, “Okay, take a bunch of Nvidia GPUs, connect them with a Mellanox pipe, and now you can solve this problem.” Instead, maybe we can just do this on a CPU. You can get a terabyte of memory on a desktop today, so there’s no memory limitation. I can feed in whatever I want. The reason we’re not doing that is we’ve bought the Kool-Aid on the idea that you need the amount of computation that exists in GPUs and TPUs to solve this problem. My conclusion is this is really not true, or in most cases it’s not true.

Ben: So to clarify, I think the amount of compute that’s needed continues to grow largely because of deep learning. In fact, there are now specialized hardware startups like Cerebras and Graphcore that optimize for the deep learning workload. So, what are you proposing?

Nir: Maybe right now, for a short time, those accelerators are still the way to go. But I’m claiming you can get the same kinds of speeds on a CPU, and you don’t get all the problems that accelerators introduce.

Ben: We’re still talking about deep learning, right?

Nir: Yes, but I’m saying in deep learning we are improving the actual algorithms, the networks we use. We started way back when with AlexNet and then went to ResNet, and for a while the networks were getting bigger and bigger and bigger, like AmoebaNet and so on. Now we can get the same accuracy with a much smaller network in terms of compute. This is a trend. So, it’s not clear to me what the advantage of having special hardware accelerators at this point in time is.

Ben: Here’s my interpretation of what the folks at Cerebras might say: we’re in a highly empirical era for machine learning. One of the bottlenecks is training time, which means, because we’re in this experimental trial and error phase, machine learning specialists can explore less of models space because each training run could take two weeks. So, they’re trying to accelerate training time so you can try many architectures.

Nir: They’re doing that based on the assumption that there is one algorithm you can run. Let’s say I’m training ResNet-50, for example. Right now, training ResNet-50 takes about 90 epochs of training with ImageNet to get the accuracy I need. This is with stochastic gradient descent using current algorithms. I’m claiming new algorithms that use a lot less compute are coming down the road—even at NeurIPS coming up, I think there will be a bunch of papers on that. I agree with them by the way, that defining and finding the models and so on is the right thing. And it could be for a while that accelerators are needed, but the algorithms, even the ones for actually finding the networks, are going to evolve and change. In a field that is about seven years old, it is very risky to build hardware because it kind of nails you to an algorithm.

Ben: OK—here’s my interpretation of what the specialized hardware folks at Cerebras would say: we’re now in year seven, which means if you look at hardware in general around this time, you know the workload, the fundamental computational tasks you have to do. So, it is the right time to build specialized hardware, which means optimizing for linear sparse matrix algebra.

Nir: Fair enough. Let me give you some examples from the real world. Once upon a time, during the last AI wave some time in the 90s, people said, “Lisp, is a great language.” But they actually invented the Lisp Machine because they thought you couldn’t actually do CAR and CDR together at the same time with current hardware, so they built these machines for running Lisp. They lasted for a very short amount of time until software overtook it. The same thing happened with networking, where you would have all these accelerators, and it’s all gone—it’s all CPU based now. This phenomenon of throwing a lot of compute in the beginning and then later it being overtaken by software is not new. So I do agree that for a while now, there might be room for accelerators as we learn what the algorithms are.

I’m claiming, though, from what I see, CPUs are not very far behind. This is really my big thing. I don’t know Cerebras’ technology well, but accelerators in general have the problem that they’re fixed in hardware and therefore are kind of limiting in what we can actually do to get speed. If we could get the same speed on a CPU, there wouldn’t be a reason for that. I’m claiming that you can actually get those speeds, or very close to them, and algorithms will get us there just by evolving.

Ben: So, part of your thesis, then, is that as CPU people are working on better hardware, people doing models are improving algorithms?

Nir: Yes. So, I’m saying that accelerators have used the chip area very well, and they’ve got some acceleration load to the CPU and they throw in the branching and dense compute, great. Now we have these accelerators. Now everybody’s doing Moore’s law. From that point on, everybody’s bound by Moore’s law, so they have an advantage. But I’m claiming that advantage will be overcome in software in the next few years.

The reason I believe this is true, again, is this thing I said originally about the brain: if the brain was a massive compute device, then I would think to myself, “There’s no hope. I’ve got to have massive compute if I want to …” But I know that it’s like a cell phone of compute, and because it’s so low in computation, there is hope that I will find that algorithm. It doesn’t even have to be the same algorithm your brain has. It just has to be an algorithm that mimics it. And since it’s an algorithm, and the CPU is a Turing machine, I should be able to implement it, more efficient or less efficient, but I will be able to implement it.

Ben: Let’s turn to your personal journey on this topic. Before you arrived at this conclusion, did you consider specialized hardware?

Nir: The way I arrived at this is, I had to build a pipeline for doing connectomics running on two petabytes of data at a terabyte an hour rate with a machine learning algorithm. So I started out with a CPU, thinking to myself, “I’ll develop the algorithm and then I’ll get a farm of GPUs to run it on. What I found was that I was running on the CPU at GPU speeds. At the time, it was the Pascal GPU. That’s what happened, basically. I ended up thinking, “Oh wow, you can do it. So if you can run on a CPU at GPU speeds, why not for every algorithm?” Why just for the thing I was doing for neurobiology?

Ben: What is the status of this line of thinking at the moment?

Nir: Well, my company Neural Magic has a product. It’s a runtime that, right now, is an inference product. It takes an Open Neural Network Exchange (ONNX) description of the neural networks and it runs it on a commodity Intel CPU, and it runs it at GPU speeds. So, we run the EfficientNets faster than the Volta GPU, which is Nvidia’s top GPU, on a commodity Intel hardware offering. It offers a better price point, but I don’t think that’s the main thing. The main thing is that it offers the potential of unlimited memory. I mean, I can do really big things. For many algorithms, we’re limited by the size of things that can fit into the GPU, and this is what a CPU really unlocks. So even if it doesn’t deliver twice the performance of the GPU, but just matches it or is close to it, it offers that with a lot of memory. And that’s where the advantage is.

Ben: What about training?

Nir: The same thing is going to be true for training. Right now, new algorithms for training are coming up. Again, going back to the ResNet example, we know it seems like most important parameters stabilize after about four or five epochs. I don’t want to be completely quoted on this—this is new research. It’s not clear exactly what it is, but let’s say not a lot of epochs and I’ve kind of converged, and now I can run sparse, and if I can run sparse, I can go down 10X in the amount of compute I need. So, the whole training process can be cut by a huge factor. And this is just the beginning of it. I’m sure over time people will come up with better and better algorithms to reduce the amount of compute you actually need.

But as long as we’re doing neural networks, the size of the network is going to stay large because we have to fit it to the size of the dataset we’re using—they grow with the size of the dataset. So that’s probably going to stay the same for a while until we change; maybe we’ll go away from stochastic gradient descent to another algorithm.

Ben: What’s the timeline on the training side, if you were to speculate?

Nir: People are working on it right now. I think it will be a couple of years before things improve significantly without special hardware, but just on existing hardware, we’re going to be able to do things much, much faster. It’s a necessity because the way the amount of compute we’re using is doubling all the time is not sustainable. We’re going to saturate, and from there we’re just going to be improving the algorithms.

If you think about it, when we think about Moore’s law and say, “We need to overcome Moore’s law, so we need special hardware.” Well, the same is true for software. We just need to specialize the software more to overcome the problems. This is where I’m coming from.

Ben: What about FPGAs?

Nir: Again, they have the same memory limitation problem. You’re talking about small things, and I agree, FPGAs, yes you can do that, but if a commodity Intel processor will do it, why do I need to go to special things? Think of the advantages of doing things with commodity hardware. Machine learning is now requiring you to have all these special things that don’t fit with containerized virtualized software. They require you to allocate, pre allocate, pieces of hardware real estate. Think if you’re a cloud provider, and you have to allocate, let’s say, a ton of Cerebras devices; they’re going to be used for one thing. Whereas at the same time, you might have 90% more in CPUs that are not used all the time. You could use those. So, from a business point of view, it’s not clear to me what the win is in having these accelerators, even though I do understand that bringing flops to the problem in the current state of algorithms has value.

Ben: So how mainstream is this line of thinking?

Nir: It’s not.

Ben: That’s where the disruption comes in, right?

Nir: That’s right. Neural Magic is actually saying things other people don’t believe. Everybody is believing we need hardware accelerators. Right?

Ben: But you’ve pointed out that the pattern of accelerators is disproven in other areas.

Nir: Right, but people never learn do they? This is why we have it repeating every time. And to some extent, it came at the right time: the arrival of these accelerators together with the so-called death of Moore’s law was a really good moment in time. It’s just that the area of machine learning is a little bit too young to commit to hardware right now. It could very well be that there is a hardware thing we will need down the road, but I don’t think we know what it is right now.

Ben: Sparse, linear algebra.

Nir: It could be many, many things.

Ben: The fact that the training times are long is what’s driving a lot of the investment then, right?

Nir: If I understand correctly, most money spent on ML right now is money spent on inference.

Ben: So, inference is definitely going to be a bigger business.

Nir: Right, because you train the model. If you notice what’s happened with the training, a lot of time is being spent by a small number of companies developing the models, but most people who use them take off-the-shelf products. People take ResNet-50.

Ben: They get the representations, then train the last layer?

Nir: Yes, exactly. So it’s really the case that, yes, there’s a lot of training, but you build it, you design a network, you start running it, and you’re running it a lot more than you ran it when you were doing the training.

Ben: A hardware startup that tackles inference is a bigger business.

Nir: No, I think a hardware startup that tackles inference is going to run into all the business problems that I mentioned, and, I don’t know if this is going to be true or not, but I think they’ll have to prove the economic case for hardware acceleration when there are all these CPU sitting around. Think of yourself as an organization. You want to start doing machine learning. You go to your boss and say, “Oh listen, I found out that I can do machine learning for this.” He says, “OK, great. What do you need to do it?” You’’ll say, “I need to order a GPU from Nvidia, then I’ll install and prepare that, and I’ll run it.” He says, “No, no, no. Just rent something on Amazon.” So you rent a GPU on Amazon and run there. Okay? This is the current model. But if you could just download software, install, and run on your desktop and do this, why would you rent a GPU in the cloud? This is the future I see. I see a software future where you just download, install, and run. And there’s nothing to tell us this is not the case.

Ben: Actually, what you’re describing reminds me a lot of what some friends in China are describing, and this is true even here, it turns out the cloud providers have a lot more CPUs than GPUs, right?

Nir: Yes

Ben: So tell me a little more about Neural Magic.

Nir: The product right now is an inference runtime. It takes an ONNX model of networks and it runs them at competitive speeds relative to GPUs. I haven’t compared them to other accelerators. I use the Volta as a measure of what can be done in terms of these speeds, and we’re competitive.

Ben: You talked a lot about examples from computer vision. What about other data types like speech and text? Are you folks looking into that?

Nir: We will down the road. We’re a very small startup right now, but it’s on our roadmap. As you go to speech and to reinforcement learning, the CPU becomes even more of a valuable resource because things are even more memory bound than they are when you’re doing neural network computation. So, image classification, all the image processing stuff is more compute intensive and less memory bound then, for example, when you do speech.

Ben: So everything you’ve talked about here is not tied to certain types of architecture. Like you can only do convolutions but not LSTMs.

Nir: Yes. I expect the things with reinforcement learning or recurrent neural networks (RNNs) and things that have state will be easier for CPUs to do.

Ben: For reinforcement learning, the neural networks are not necessarily big architectures, but there’s a lot of simulation.

Nir: Right. Again, it’s not clear to me that a CPU cannot handle this workload. So, to be honest, to be fair, Neural Magic has not addressed this problem yet. I’m very comfortable saying that we know we can run neural networks for image classification at GPU speeds. I’m not comfortable saying that about other areas, but I expect to see good results.

Ben: In terms of hardware, it turns out that a lot of hardware vendors also invest a lot in software, right? Nvidia and CUDA and even Cerebras are investing in software to make sure TensorFlow and PyTorch run fast in their hardware. So, you’re all about software, right?

Nir: Right. So, every hardware device requires a huge software stack on top of it. In particular, it requires the specific functionality.

Ben: You can have better hardware than me, but if I have better libraries…

Nir: That’s right. This is part of the disadvantage that a lot of companies have relative to Nvidia, which has built a beautiful kind of software ecosystem.

Ben: Years of investment.

Nir: Right. And also on CPUs; we have years of investment in the software ecosystem. This is where Neural Magic is playing. There’s so much of an ecosystem for software on CPUs in particular, even for companies—containerized software, virtualized software. And in machine learning, there’s a lot of support and people are used to doing things. So rather than coming up with a new accelerator that requires its own software stack and its own way of running things and all the peculiarities of that hardware, we know how CPUs work and we know how GPUs work; it’s going to be hard for other competitors to come in there and unseat these existing well-established software frameworks.

Ben: Let’s close by having you describe your plan to convert the rest of the world to your line of thinking.

Nir: Well, we’re now running pilots with a bunch of companies, and as the product stabilizes, we’ll start to offer it to universities; we’ll offer it to companies to try out. If somebody can just download software, install, and run, then they’ll see the value of not having to deal with specialized hardware.

Ben: By the way, the other thing that people who don’t follow this space closely don’t realize is, the hardware startups, the capital investment is immense, right?

Nir: Right.

Ben: So you’re making a big bet. Then there’s the manufacturing risks. Then the software libraries…

Nir: Exactly. This is why it’s going to be very, very hard for accelerators that are not supported by companies like Google, Facebook, or Amazon, to make a play in the hardware space.

Ben: Economies of scale.

Nir: Right. But we can offer something that these accelerators don’t provide, which is that you can use all that hardware you have lying around. Use it to get comparable performance. If I get a Volta to run a ResNet-50 at a certain speed, if for the same price I can take 10 CPUs and do the same thing, even that is a valuable commodity for the cloud provider, because those CPUs are sitting idle.

Ben: Exactly. The cloud providers have so much CPU, right?

Nir: Also, if you think of companies like banks, rather than install mass farms of hardware accelerators, they can just use their existing, very powerful CPU-based computing infrastructure to do the same things. In a typical program, you have a program that has all kinds of components, and machine learning is typically part of that. It’s not just a standalone piece. So, if you’re running your machine learning software as part of a big application, rather than having to have an accelerator there, you can continue to run the whole thing in the exact device that you ran. This is a big advantage.

Ben: Of course, the vision of the accelerator and GPU company says, “No, run the entire pipeline on our system,” or “you will port all of the things you need over to our system.” Right?

Nir: Of course, yes.

Ben: But then, as you point out, the ecosystem of tools for CPU span much more than the model building and model inference, right?

Nir: That’s right. Exactly.

Ben: And people already know how to use those tools.

Nir: Yes, exactly.

Ben: I’m a firm believer in, if possible, let people use the tools they’re familiar with.

Nir: The way I view the Neural Magic play, I know it’s a contrarian play right now, but the way I see it is, we’re trying to bring machine learning back on track. We’re trying to say, “Look, maybe there are edge cases when you need an accelerator, but for typical things that people do, it’s good enough to run on a CPU with Neural Magic software or somebody else—I’m sure there will be other companies trying to do the same thing. So, you run the CPU and you don’t incur all the pain that would come from doing it with an accelerator. This is a totally valid play in the software space.

Ben: Honestly, we need different approaches and different ideas because we’re still very early in all of these things.

Nir: That’s right. Very early in the game. We really don’t understand that much about ML and deep learning the way it’s done today.

Ben: That was great. Thank you very much, Nir.

Nir: Thank you.