Pruning and Quantizing ML Models With One Shot Without Retraining
Neural Magic's teams have adapted the advanced pruning and quantization methods to work without retraining, using one shot. Our methods result in a meaningful model compression where 60% of the weights can be completely removed and the entire model quantized to INT8. All while recovering 99% of the accuracy! This approach produces a more than 4X speedup and requires only minutes of work. This video summarizes our methods using Computer Vision and NLP examples so you can utilize them in your current work and research.