TL;DR: In addition to the general hyperparameters described in the previous post, the sparsity to target per layer is arguably the most critical hyperparameter you can set. Below we give you the reason why, and show you how.
Reading time: 10 minutes, 47 seconds

Welcome to Part 4 in Neural Magic’s five-part blog series on pruning in machine learning. In case you missed it, Part 1 gave a pruning overview, detailed the difference between structured vs. unstructured pruning, and described commonly used algorithms, including General Magnitude Pruning (GMP). In Part 2 we argued that GMP is one of the best pruning approaches to use due to its simplicity, ease of use, and performance on a wide variety of models. And we discussed three general stages to GMP: stabilization, pruning, and fine-tuning. And in Part 3, we dug deeper into GMP hyperparameters that must be defined: learning rate, pruning update frequency, and pruning schedule function.
In addition to the general hyperparameters described above, the sparsity to target per layer is arguably the most critical hyperparameter you can set. It is the one that controls the amount of performance speedup for a network and most strongly correlates with the likelihood of accuracy recovery after pruning. The more traditional and older approach with GMP was to select a uniform sparsity level for all but a few of the intuitively more sensitive layers, such as the input layer. This assumes that all layers affect the loss and performance equally for the same sparsity levels, which experimentally has been found to be very far from the truth.
Each layer in a neural network differs in some way, whether it is the input shape, number of parameters, kernel size, or even the position/purpose of the layer in the overall network. This difference leads to dissimilar effects for each layer on the total loss and performance of the network. It is challenging to measure these sensitivities to pruning precisely due to the size of modern-day neural networks. However, there are approaches to estimating both that match the exact sensitivities reasonably closely. Using these approximations, it becomes much easier to determine the proper sparsity for each layer in the network and far exceeds the uniform approach.

Approximating the Layerwise Loss Sensitivity
The maximum sparsity that can be obtained for each layer without affecting the loss varies across layers. Despite this, there are some general guidelines that we have found to work well internally for maximizing the recovery of the loss:
- Large fully connected layers can generally be pruned to around 95%.
- Large 3x3 convolutions can generally be pruned to around 90%.
- Large 1x1 convolutions can generally be pruned to around 80%.
- Grouped convolutions or depthwise convolutions generally should not be pruned at all as they are already structurally pruned.
The general guidelines are only guidelines, though. The proper sparsity for layers deviates quite a bit depending on the attributes of the layer, architecture of the network, and complexity of the dataset. Because of these differences, it is much better to estimate the layerwise sensitivity as it pertains to the loss. There are a few different methods that have surfaced through research to better the approximations. Several are listed in detail below for completeness; however, we have found the global magnitude approximation to work best by far.
Global Magnitude Approximation (Recommended)
The global magnitude approximation is a relatively quick and painless way to approximate the loss sensitivities for each layer. The assumption is simply that the magnitude of the weights across the network are ordered correct enough to be statistically valid when comparing across layers. For example, a layer with larger weights on average is more sensitive than one with weights closer to zero. This approach has been used in more complicated methods such as The Lottery Ticket Hypothesis; however, a thorough study of its effectiveness was provided in a recent paper by Dan Alistarh, Neural Magic’s research lead.
The approximation generally works because SGD, along with weight regularization, pushes unimportant weights globally towards zero. Therefore, layers that are more overparameterized, as determined by the training process, will have more weights closer to zero. A general algorithm for this can be formulated as the following:
"Take the average of the absolute values of the weights for each layer and then sort them from smallest to largest. The highest layers are the most sensitive to the loss and should be pruned less. The lowest layers are the least sensitive to the loss and should be pruned more."

APIs as well as scripts are made available in the Neural Magic ML Tooling suite for running the approximation with support for PyTorch, TensorFlow, and ONNX. Contact us for directions on how to download Neural Magic ML Tooling and get your hands on on the APIs and scripts.
One-Shot Approximation
The one-shot approximation measures the effect of pruning each layer individually on the loss function as compared to the baseline without retraining. The concept was popularized as part of the AMC paper from Song Han’s lab. To implement, each layer is individually pruned to increasing sparsity levels while measuring the deviations in the loss/output of the model. All of this can be done running only a few hundred examples through the model for each step. So, it is much faster than full retraining; however, it still requires many repeated forward pass executions of the network. The piecewise integral of the sparsity versus loss curve is then calculated for each layer with larger values corresponding to higher sensitivity. After sorting these values, the highest layers should be pruned less, and the lowest layers should be pruned more.

APIs as well as scripts are made available in the Neural Magic ML Tooling suite for running the approximation with support for PyTorch, TensorFlow, and ONNX. Contact us for directions on how to download Neural Magic ML Tooling and get your hands on on the APIs and scripts.
Approximating the Layerwise Performance Sensitivity
Layers within the network are not created equal in terms of how much performance benefit can be achieved from pruning that layer. This is true not only from the performance speedup for the individual layer, but also in terms of the contribution to the total execution time for the model. For example, in a standard ResNet-50 SSD model, the first four convolutional heads in a network that has over 60 layers can take up to 40% of the total execution time. Thus, it is important to focus pruning on those heads to improve the inference speed. Additionally, earlier layers for most computer vision models are more memory intensive and see less speedup from sparsity.
For the Neural Magic Inference Engine, a layer must be at least 40% sparse to see any speedup potentially. Speedups for a layer then usually start to become interesting for performance gains around the 70% mark. Additionally, provided that the layer is wholly bounded by compute, the relationship between sparsity and performance is exponential. Stepping from 80% to 85% gives the same relative gain in performance as stepping from 85% to 87.5%. Finally, grouped and depthwise convolutions are almost wholly memory bound and therefore will not see any speedup from pruning (they are already structurally pruned).
Overall, many settings can influence how each layer responds to sparsity for performance including the layer size and attributes, batch size used with the model, and hardware considerations such as CPU type and number of cores the model is run on. This complexity makes it nearly impossible to approximate the effect using an equation. However, artificially pruning the model, running it for inference, and measuring the speeds for each layer works very well. After running the model at multiple uniform sparsity levels, the speedup for each layer can be interpolated at any sparsity level. Additionally, the speedup in terms of contribution to the entire execution can be estimated. Generally the best way is to execute the baseline model and then compare layerwise times at 90% sparsity for each layer (sparse time - baseline time). After sorting these values, the highest layers should be pruned less and the lowest/more negative layers should be pruned more.

APIs as well as scripts are made available in the Neural Magic ML Tooling suite for running the measured approximation through the Neural Magic Inference Engine for ONNX representation of models. Contact us for directions on how to download Neural Magic ML Tooling and get your hands on on the APIs and scripts.
Selecting the Sparsity per Layer
As mentioned previously, selecting the same sparsity for every layer in the model is convenient, but fails to take into account the intricacies of the neural network architectures and generally performs poorly. Adjusting the sparsity levels by using the layerwise loss and/or performance pruning sensitivity analysis will significantly improve the end accuracy recovery along with the overall performance for the model.
Redistributing the sparsity for each layer in the network to optimize for loss and/or performance as much as possible is algorithmically possible, but can be tricky to implement generically. The Neural Magic team is actively researching state of the art techniques using machine learning algorithms to add this capability to the ML Tooling product offering. The goal is a fully automated solution that can derive an achievable total sparsity for a model, redistribute the sparsity across the layers for better recovery/performance, and provide easy configurations to control the total sparsity and the degree to optimize for performance versus loss. This automated solution reduces the number of choices from a combinatorial explosion of each sparsity per layer to only two: the overall sparsity and how much to target optimizing for loss and the performance. Currently, these additions are slated for release in September.
In place of the algorithmic solution, bucketing the layers into pruning groups works nearly as well. Generally, the following buckets in reference to pruning should be used: none, low, medium, high.
- none: reserved for the layers that do not give any benefit or may hurt the model if pruned
- low: reserved for the layers that give limited benefit from pruning
- medium: reserved for the layers that give mediocre benefit from pruning
- high: reserved for the layers that give maximal benefit from pruning
How each layer is sorted into one of the buckets depends on the end goal for the data scientist: optimize for the best loss, optimize for the best performance, or optimize for a balance of the best loss and performance (the recommended approach). Once the layers are sorted, though, three sparsity levels for the low, medium, and high buckets must be chosen as hyperparameters. A good starting point is using 70%, 80%, and 90% respectively and then honing from there.
Bucketing for Best Loss
To bucket the pruning levels for achieving the best loss, a loss sensitivity analysis must first be run as detailed in the Approximating the Layerwise Loss Sensitivity section. From there, the algorithm is simple; the top 5% of layers (this number can change based on the architecture of the model) that are most sensitive to the loss as determined by the analysis are put in the none bucket. The remaining are sorted from most sensitive to least sensitive and broken by thirds into the high, medium, and low buckets accordingly.
APIs as well as scripts are made available in the Neural Magic ML Tooling suite for creating buckets for the best loss for ONNX representation of models. Contact us for directions on how to download Neural Magic ML Tooling and get your hands on on the APIs and scripts.
Bucketing for Best Performance
To bucket the pruning levels for achieving the best performance, a performance sensitivity analysis must first be run as detailed in the Approximating the Layerwise Performance Sensitivity section. From there, the algorithm is simple; the top 5% of layers (this number can change based on the architecture of the model) that are least sensitive to the performance as determined by the analysis are put in the none bucket. The remaining are sorted from least to most sensitive and broken by thirds into the high, medium, and low buckets accordingly.
APIs as well as scripts are made available in the Neural Magic ML Tooling suite for creating buckets for the best performance for ONNX representation of models. Contact us for directions on how to download Neural Magic ML Tooling and get your hands on on the APIs and scripts.
Bucketing for Balancing Loss and Performance (Recommended)
To bucket the pruning levels for achieving a balance between the best loss and performance, both a loss sensitivity analysis and a performance sensitivity analysis must first be run. (Loss and performance sensitivity analysis are detailed in the Approximating the Layerwise Performance Sensitivity section.) From there, sort the layers by increasing sensitivity based on both the performance sensitivity and loss sensitivity individually. After that, the following table can be used to determine the group to which each layer should be assigned.
Perf Sens Bottom 5% | Perf Sens Bottom 1/3 | Perf Sens Middle 1/3 | Perf Sens Top 1/3 | |
Loss Sens Bottom 1/3 | none | medium | high | high |
Loss Sens Middle 1/3 | none | low | medium | high |
Loss Sens Top 1/3 | none | low | low | medium |
Loss Sens Top 5% | none | none | none | none |
APIs, scripts, as well as the model_pruning_config.py script that can be used to generate a pruning config file based on the balanced approach, are made available in the Neural Magic ML Tooling suite for creating buckets for the best performance for ONNX representation of models. Contact us for directions on how to download Neural Magic ML Tooling and get your hands on on the APIs and scripts.
Next up: ResNet-50 Example Walkthrough
In our final post of the pruning series, we'll take what we learned and put it into practice. We'll get an ONNX model representation for the ResNet-50 model and use the Neural Magic ML Tooling suite to prune it for performance.
If you have questions or would like us to help you to speed up your neural networks with pruning, reach out to us!