Feb 25, 2021
Author(s)
This blog was originally posted by Na Zhang on VMware's Office of the CTO Blog. You can see the original copy here.
Increasingly large deep learning (DL) models require a significant amount of computing, memory, and energy, all of which become a bottleneck in real-time inference where resources are limited. In this post, we detail our work collaborating with Neural Magic to demonstrate accelerated machine learning (ML) inference on commodity hardware (CPU) through two innovative techniques: model compilation optimization and algorithmic neural network pruning/sparsification. Model compilation optimization is a post-training step that can adapt the model artifact and model execution to better align with the underlying hardware. Pruning/sparsification is a trending technique to reduce DL computations through trimming low-ranking neurons, removing connections. It can efficiently produce smaller and faster models with minimal loss in accuracy.
Below, we present an overview of Neural Magic, our performance evaluation results on vSphere 7.0 versus bare metal, and a performance comparison with other leading industry inference techniques, including ONNX Runtime and OpenVINO.
Neural Magic
Neural Magic is a software solution for DL inference acceleration that enables companies to use CPU resources to achieve ML performance breakthroughs at scale. It provides a suite of tools to select, build, and run performant DL models on commodity CPU resources, including:
- Neural Magic Inference Engine (NMIE) runtime, which delivers highly optimized performance for DL applications running on x86 CPUs.
- Neural Magic ML Tooling, which provides pruning and sparsification libraries to simplify recalibration efforts to maximize performance and accuracy on any model. These pruning and sparsification libraries work with mainstream DL frameworks, such as Tensorflow and Pytorch.
- A set of pre-trained and pre-optimized models for customers to enable quick deployment using their own data. The pre-trained models are typical computer-vision models, such as image classification and object detection.
Performance
Benchmark
In ML, inference speed measures how fast a system can process input and produce results using a trained model. For example, in our benchmarking, it refers to the sequence of taking an image of size 224x224x3 (224×224 pixels, RGB 3 channels), feeding it to a neural network, and returning a classification result produced by the model. Every test case was evaluated with two neural network models (ResNet-50 and VGG-16), different numbers of CPU cores (4, 10, 20), and two batch sizes (1, 64).The two models were pretrained with the ImageNet dataset and have baseline validation Top 1 accuracies of 76.10% and 71.59%, respectively. The hardware and software details appear below.
Hardware and Software for Testing
Host | PowerEdge R740, dual socket, 10 cores per socket, 40 logical processors (Hyperthreading on) |
Processor Type | Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz |
GPU | NVIDIA Tesla T4 16GB |
ESXi | ESXi: 7.0.0, build number: 15843807 |
OS | Ubuntu 18.04 |
NVIDIA driver | 440.33.01 |
Docker | 19.03.6 |
CUDA Toolkit/cuDNN | 10.2/8 |
Neural Magic | v1.1.0 |
ONNX Runtime | 1.3.0 |
OpenVINO | 2020.4.287 |
Test 1: Neural Magic Results on vSphere
First, we examined how Neural Magic performs on vSphere. As illustrated in Figure 1, there are three available pretrained model types in Neural Magic model repository: Base, Recal, and Recal-Perf (cases 1, 2, and 3). Specifically, Base is the baseline model obtained from the standard training process. Recal means a model has been recalibrated for performance optimization on the NMIE while maintaining ~100% of baseline accuracy. Recal-Perf has been recalibrated for performance optimization on the NMIE while maintaining ~99% of baseline accuracy.
The performance results for ResNet-50 and VGG-16 are shown in Figures 2 and 3. In the figures, the x axis represents different test cases using different number of threads (one thread per core), with batch size 1 (latency-oriented inference) or batch size 64 (throughput-oriented inference). The y-axis is images per second (higher is better). Comparing the results for NeuralMagic-Base, NeuralMagic-Recal, and NeuralMagic-RecalPerf, there are significant performance gains with pruning and recalibration. That means we can get significant performance boosts by pruning unnecessary (low ranking) neurons with no accuracy loss. Further, if one is willing to sacrifice ~1% accuracy, the performance benefits are larger, as shown in ResNet-50 results. However, the VGG-16 Recal-Perf shows slightly worse performance than Recal. For this issue, Neural Magic has improved the pruning technology, which is expected to give better VGG-16 Recal-Perf performance than Recal.
Test 2: Virtual (vSphere) vs. Bare Metal
Next, we examined how virtual performance compares with bare-metal performance running Neural Magic. We set up the testbeds shown in Figure 4 on the same host with 20 cores (40 logical processors with hyperthreading enabled) and 192 GB memory. Generally, the VM should be sized according to the requirements of the workload. Our benchmark is a CPU-heavy, multi-threaded application that is capable of using all available cores. Consequently, we utilized one large VM per host in our virtual testbed. Though hyperthreading is enabled, the VM is configured with 20 vCPUs to match the number of physical CPU cores. The extra logical cores are left for use by ESXi hypervisor helper threads. This is standard practice for performance-critical high-performance computing (HPC) and ML workloads.
Figures 5 and 6 show the virtual versus bare-metal performance ratios for the test cases in the previous section. As can be seen from the results, most test cases show small degradations, normally within 5%, with a maximum of 7%.
Test 3: Neural Magic vs. OpenVINO vs. ONNX Runtime vs. ONNX Runtime with GPU
In the previous two sections, we presented Neural Magic performance results with different model types on vSphere and compared virtual vs. bare-metal performance. In this section, we focus on benchmarking Neural Magic against other available inference frameworks (ONNX Runtime and OpenVINO), all running in the virtual environment. Figure 7 illustrates the flow for inference with different test cases. Note that one of the core components of the OpenVINO toolkit is the Model Optimizer, which is able to convert a trained neural network in .onnx format to intermediate representation (IR) for use in inference. The IR model format (.xml and .bin) was used for OpenVINO testing (case 4) for optimized performance, as recommended by OpenVINO.
The performance results for ResNet-50 and VGG-16 are shown in Figures 8 and 9.
From the results, we can see the following:
- Comparing NeuralMagic-Base (case 1), OpenVINO-Base (case 4), and ONNXRuntime-Base (case 5), Neural Magic, OpenVINO, and ONNX Runtime have same level of performance for ResNet-50 model, while NeuralMagic-Base’s performance is significantly better than the other two inference solutions for the VGG-16 model. The reason that the VGG-16 model performs significantly better with NeuralMagic Inference Engine comes down to how the neural networks are connected and what their components are. In this case, VGG-16 is almost entirely 3×3 convolutions with max pooling. This allows Neural Magic to apply a state-of-the-art algorithm, as well as memory optimizations on the pooling operations. On the other hand, ResNet-50 architecture has a lot of 1×1 convolutions, and so has no max pooling operations.
- Comparing ONNXRuntime-Base (case 5) and ONNXRuntimeGPU-Base (case 6), GPU is much faster than CPU, as expected. For example, for ResNet-50 model, ONNX Runtime with 1 NVIDIA T4 GPU is 9.4x and 14.7x faster than CPU with four cores for batch size 1 and batch size 64.
- When scaling to 20 CPU cores, NeuralMagic-RecalPerf (case 3) is even better than ONNXRuntimeGPU-Base (case 6) with NVIDIA T4 GPU for ResNet-50 models with batch size 64. In other words, using both runtime optimization and pruning, it is possible to get GPU-level performance on CPU with Neural Magic.
Summary
For ML inference, the choice between CPU, GPU, or other accelerators depends on many factors, such as resource constraints, application requirements, deployment complexity, and economic cost. In our tests, we showcased the use of CPU to achieve ultra-fast inference speed on vSphere through our partnership with Neural Magic. Our experimental results demonstrate small virtual overheads, in most cases. The results also show that significant performance boosts can be achieved with Neural Magic’s techniques: model runtime optimization and network pruning.
VMware is committed to helping customers run all their ML workloads on vSphere— from the edge to the datacenter to the cloud — as well as for both training and inference. While we continue to strongly support adoption of different hardware accelerators in private and hybrid cloud environments to accelerate compute-intensive workloads, VMware collaborates with partners to encourage innovation and facilitate transformation in all aspects of ML.