Accelerating Machine Learning Inference on CPU with VMware vSphere and Neural Magic


This blog was originally posted by Na Zhang on VMware’s Office of the CTO Blog. You can see the original copy here.

Increasingly large deep learning (DL) models require a significant amount of computing, memory, and energy, all of which become a bottleneck in real-time inference where resources are limited. In this post, we detail our work collaborating with Neural Magic to demonstrate accelerated machine learning (ML) inference on commodity hardware (CPU) through two innovative techniques: model compilation optimization and algorithmic neural network pruning/sparsification. Model compilation optimization is a post-training step that can adapt the model artifact and model execution to better align with the underlying hardware. Pruning/sparsification is a trending technique to reduce DL computations through trimming low-ranking neurons, removing connections. It can efficiently produce smaller and faster models with minimal loss in accuracy.

Below, we present an overview of Neural Magic, our performance evaluation results on vSphere 7.0 versus bare metal, and a performance comparison with other leading industry inference techniques, including ONNX Runtime and OpenVINO.

Neural Magic

Neural Magic is a software solution for DL inference acceleration that enables companies to use CPU resources to achieve ML performance breakthroughs at scale. It provides a suite of tools to select, build, and run performant DL models on commodity CPU resources, including:

  • Neural Magic Inference Engine (NMIE) runtime, which delivers highly optimized performance for DL applications running on x86 CPUs.
  • Neural Magic ML Tooling, which provides pruning and sparsification libraries to simplify recalibration efforts to maximize performance and accuracy on any model. These pruning and sparsification libraries work with mainstream DL frameworks, such as Tensorflow and Pytorch.
  • A set of pre-trained and pre-optimized models for customers to enable quick deployment using their own data. The pre-trained models are typical computer-vision models, such as image classification and object detection. 



In ML, inference speed measures how fast a system can process input and produce results using a trained model. For example, in our benchmarking, it refers to the sequence of taking an image of size 224x224x3 (224×224 pixels, RGB 3 channels), feeding it to a neural network, and returning a classification result produced by the model. Every test case was evaluated with two neural network models (ResNet-50 and VGG-16), different numbers of CPU cores (4, 10, 20), and two batch sizes (1, 64).The two models were pretrained with the ImageNet dataset and have baseline validation Top 1 accuracies of 76.10% and 71.59%, respectively. The hardware and software details appear below.

Hardware and Software for Testing
HostPowerEdge R740, dual socket, 10 cores per socket, 40 logical processors (Hyperthreading on)
Processor TypeIntel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
ESXiESXi: 7.0.0, build number: 15843807
OSUbuntu 18.04
NVIDIA driver440.33.01
CUDA Toolkit/cuDNN10.2/8
Neural Magicv1.1.0
ONNX Runtime1.3.0
Test 1: Neural Magic Results on vSphere

First, we examined how Neural Magic performs on vSphere. As illustrated in Figure 1, there are three available pretrained model types in Neural Magic model repository: Base, Recal, and Recal-Perf (cases 1, 2, and 3). Specifically, Base is the baseline model obtained from the standard training process. Recal means a model has been recalibrated for performance optimization on the NMIE while maintaining ~100% of baseline accuracy. Recal-Perf has been recalibrated for performance optimization on the NMIE while maintaining ~99% of baseline accuracy.

Figure 1: Illustration of the flow with Neural Magic Inference Engine with different model types  

The performance results for ResNet-50 and VGG-16 are shown in Figures 2 and 3. In the figures, the x axis represents different test cases using different number of threads (one thread per core), with batch size 1 (latency-oriented inference) or batch size 64 (throughput-oriented inference). The y-axis is images per second (higher is better). Comparing the results for NeuralMagic-Base, NeuralMagic-Recal, and NeuralMagic-RecalPerf, there are significant performance gains with pruning and recalibration. That means we can get significant performance boosts by pruning unnecessary (low ranking) neurons with no accuracy loss. Further, if one is willing to sacrifice ~1% accuracy, the performance benefits are larger, as shown in ResNet-50 results. However, the VGG-16 Recal-Perf shows slightly worse performance than Recal. For this issue, Neural Magic has improved the pruning technology, which is expected to give better VGG-16 Recal-Perf performance than Recal.

Figure 2: Inference speed for classification task with ResNet-50 model (bs =batch size)
Figure 3: Inference speed for classification task with VGG-16 model (bs = batch size)
Test 2: Virtual (vSphere) vs. Bare Metal

Next, we examined how virtual performance compares with bare-metal performance running Neural Magic. We set up the testbeds shown in Figure 4 on the same host with 20 cores (40 logical processors with hyperthreading enabled) and 192 GB memory. Generally, the VM should be sized according to the requirements of the workload. Our benchmark is a CPU-heavy, multi-threaded application that is capable of using all available cores. Consequently, we utilized one large VM per host in our virtual testbed. Though hyperthreading is enabled, the VM is configured with 20 vCPUs to match the number of physical CPU cores. The extra logical cores are left for use by ESXi hypervisor helper threads. This is standard practice for performance-critical high-performance computing (HPC) and ML workloads.

Figure 4: Testbed Configuration

Figures 5 and 6 show the virtual versus bare-metal performance ratios for the test cases in the previous section. As can be seen from the results, most test cases show small degradations, normally within 5%, with a maximum of 7%.

Figure 5: Virtual vs. bare-metal performance ratios for the ResNet-50 model. Higher translates to better virtual performance, with 1.0 meaning that virtual performance is identical to bare metal.
Figure 6: Virtual vs. bare-metal performance ratios for the ResNet-50 model. Higher means better virtual performance, with 1.0 meaning that virtual performance is identical to bare metal.
Test 3: Neural Magic vs. OpenVINO vs. ONNX Runtime vs. ONNX Runtime with GPU

In the previous two sections, we presented Neural Magic performance results with different model types on vSphere and compared virtual vs. bare-metal performance. In this section, we focus on benchmarking Neural Magic against other available inference frameworks (ONNX Runtime and OpenVINO), all running in the virtual environment. Figure 7 illustrates the flow for inference with different test cases. Note that one of the core components of the OpenVINO toolkit is the Model Optimizer, which is able to convert a trained neural network in .onnx format to intermediate representation (IR) for use in inference. The IR model format (.xml and .bin) was used for OpenVINO testing (case 4) for optimized performance, as recommended by OpenVINO. 

Figure 7: Illustration of the flow for inference with different test cases. Orange circles represent different test cases.

The performance results for ResNet-50 and VGG-16 are shown in Figures 8 and 9. 

From the results, we can see the following:

  • Comparing NeuralMagic-Base (case 1), OpenVINO-Base (case 4), and ONNXRuntime-Base (case 5), Neural Magic, OpenVINO, and ONNX Runtime have same level of performance for ResNet-50 model, while NeuralMagic-Base’s performance is significantly better than the other two inference solutions for the VGG-16 model. The reason that the VGG-16 model performs significantly better with NeuralMagic Inference Engine comes down to how the neural networks are connected and what their components are. In this case, VGG-16 is almost entirely 3×3 convolutions with max pooling. This allows Neural Magic to apply a state-of-the-art algorithm, as well as memory optimizations on the pooling operations. On the other hand, ResNet-50 architecture has a lot of 1×1 convolutions, and so has no max pooling operations.  
  • Comparing ONNXRuntime-Base (case 5) and ONNXRuntimeGPU-Base (case 6), GPU is much faster than CPU, as expected. For example, for ResNet-50 model, ONNX Runtime with 1 NVIDIA T4 GPU is 9.4x and 14.7x faster than CPU with four cores for batch size 1 and batch size 64.
  • When scaling to 20 CPU cores, NeuralMagic-RecalPerf (case 3) is even better than ONNXRuntimeGPU-Base (case 6) with NVIDIA T4 GPU for ResNet-50 models with batch size 64. In other words, using both runtime optimization and pruning, it is possible to get GPU-level performance on CPU with Neural Magic.  
Figure 8: Inference speed for classification task with ResNet-50 model
Figure 9: Inference speed for classification task with VGG-16 model


For ML inference, the choice between CPU, GPU, or other accelerators depends on many factors, such as resource constraints, application requirements, deployment complexity, and economic cost. In our tests, we showcased the use of CPU to achieve ultra-fast inference speed on vSphere through our partnership with Neural Magic. Our experimental results demonstrate small virtual overheads, in most cases. The results also show that significant performance boosts can be achieved with Neural Magic’s techniques: model runtime optimization and network pruning.

VMware is committed to helping customers run all their ML workloads on vSphere— from the edge to the datacenter to the cloud — as well as for both training and inference. While we continue to strongly support adoption of different hardware accelerators in private and hybrid cloud environments to accelerate compute-intensive workloads, VMware collaborates with partners to encourage innovation and facilitate transformation in all aspects of ML.