Enable sparsity with a few lines of code to boost neural network inference performance.
pip install sparseml[torch]
--output_dir models/sparse_quantized \
--model_name_or_path "zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/12layer_pruned80_quant-none-vnni" \
--recipe "zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/12layer_pruned80_quant-none-vnni?recipe_type=transfer-question_answering" \
--distill_teacher "disable" \
--dataset_name squad --do_train --do_eval
Large Models Are InefficientMany of the top models across the NLP and computer vision domains are not usable in a real deployment scenario. While they are extremely accurate, they are too large and computationally intensive, thus hard and expensive to deploy.
Small Models Introduce SacrificesDecreasing the model size to accommodate your deployment scenario needs forces you to make tradeoffs between performance and accuracy. Smaller models, while faster and more efficient, deliver less accuracy on real-world data.
What if you could deliver big model accuracy with small model perks?
State-of-the-Art Compression Techniques Applied with Ease
Keeping up with lightning-fast advancements in model compression research is hard. Putting state-of-the-art (SOTA) research into practice is even harder.
SparseML is a toolkit that includes APIs, CLIs, scripts, and libraries that apply SOTA sparsification algorithms such as pruning and quantization to any neural network, using only a few lines of code.
Sparsification, easily applied to your models and data with SparseML, makes your deployment efficient by:
- Reducing the size of the model tenfold;
- Speeding up inference exponentially;
- Maintaining model accuracy.