Enable sparsity with a few lines of code to boost neural network inference performance.

pip install sparseml[torch]

sparseml.transformers.question_answering \
--output_dir models/sparse_quantized \
--model_name_or_path "zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/12layer_pruned80_quant-none-vnni" \
--recipe "zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/12layer_pruned80_quant-none-vnni?recipe_type=transfer-question_answering" \
--distill_teacher "disable" \
--dataset_name squad --do_train --do_eval

Large Models Are Inefficient

Many of the top models across the NLP and computer vision domains are not usable in a real deployment scenario. While they are extremely accurate, they are too large and computationally intensive, thus hard and expensive to deploy.

Small Models Introduce Sacrifices

Decreasing the model size to accommodate your deployment scenario needs forces you to make tradeoffs between performance and accuracy. Smaller models, while faster and more efficient, deliver less accuracy on real-world data.

What if you could deliver big model accuracy with small model perks?


State-of-the-Art Compression Techniques Applied with Ease

Keeping up with lightning-fast advancements in model compression research is hard. Putting state-of-the-art (SOTA) research into practice is even harder.

SparseML is a toolkit that includes APIs, CLIs, scripts, and libraries that apply SOTA sparsification algorithms such as pruning and quantization to any neural network, using only a few lines of code.

Sparsification, easily applied to your models and data with SparseML, makes your deployment efficient by:

  • Reducing the size of the model tenfold
  • Speeding up inference exponentially
  • Maintaining model accuracy

Learn more about sparsification -->>

Reduce Model Size

Yield more flexibility in deployment with smaller models.

Accelerate Inference

Run models with lower latency and higher throughput.

SOTA Research

Use always-current, cutting-edge compression algorithms to make inference efficient.



SOTA Optimization Algorithms
Get better model performance at the same accuracy by introducing sparsity
Pre-Optimized Models
Easily apply your data with ready-to-go recipes found in SparseZoo
Standard Logging
Visibility around model experiment tracking through TensorBoard and Weights & Biases
Easy Export
Straightforward pre-built model export pipeline, so you can deploy faster
Training Pipeline Integrations
Integrated with SOTA training pipelines like Pytorch, Ultralytics, and HuggingFace
Manageable Workflow
Get started in just a few lines of code
Custom Modifiers
Modify any training graph and easily get to a functioning implementation
Free & Open Sourced
Free and adaptable software, always updated with SOTA algorithms

Stop making sacrifices between model performance and accuracy. Enable both with SparseML.