Ask. Share.
Improve Your ML Magic.

Together with our community, Neural Magic brings innovation to CPU and GPU inference. Connect and engage with fellow ML Practitioners interested in model compression and deployment with best-in-class performance and efficiency.

Star and Contribute

Connect And Follow us:

Neural Magic Leaps Into GPU Acceleration

We’re thrilled to announce our leap into GPU acceleration with the launch of nm-vllm, aimed at supercharging inference serving for compressed LLMs on GPUs.


Community Tools

Accelerate Your Inference Now

Compress your models using our open-source model optimization libraries and deploy on hardware of your choice, CPU or GPU.

Deploy on GPUs


Incorporate the latest LLM optimizations for optimal GPU performance. Deploy LLMs on GPUs of choice using nm-vllm, our opinionated fork of the popular vLLM project.

Github Icon



Tool Image

Deploy on CPUs


Accelerate LLMs, CV, and NLP models with DeepSparse, our free-to-try inference runtime. Use CPUs you already own, x86 and ARM.

Github Icon



Tool Image

Optimize Models for Inference


Compress your LLMs, CV, and NLP models for fast and efficient inference using SparseML. Apply quantization and sparsity using pre-configured recipes. Reduce hardware requirements and costs.

Tool Image
Tool Image

Events, Webinars, and Meetups

Let’s Get Together

Event Image

World Summit AI Americas

Stop by our booth #B50 for a brief demo of Neural Magic. On Wednesday, join our Innovation Insight where we'll share the HPC problem no one is talking about.


Apr 24, 2024


Montreal, Canada

Event Image

The AI Conference 2024

Let's shape the future of AI together. Our co-founder, Nir Shavit, is speaking on Tuesday. Check his talk and meet our team while there.


Sep 10, 2024


San Francisco, CA

Event Image

NeurIPS 2024

We are excited to share our latest research and learn from everyone at this year's NeurIPS! Will you be there?


Dec 09, 2024


Vancouver, Canada



Get Technical With Our Published ML Research

Sparse Fine-Tuning for Inference Acceleration of Large Language Models

We consider fine-tuning pre-trained LLMs on specialized tasks while inducing sparsity in their weights. We observe that standard loss-based fine-tuning may fail to recover accuracy at high sparsities. To address this, we perform a study of distillation-type losses, determining an L2-based distillation approach which enables accurate recovery at higher sparsities, across all models.

Neural Magic & IST Austria

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

We show that large-scale generative pre-trained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity.

Neural Magic & IST Austria

Sparse*BERT: Sparse Models Generalize To New Tasks and Domains

Models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks. Models that are pruned during pre-training using general domain masked language models can transfer to novel domains and tasks without extensive hyperparameter exploration or specialized approaches.

Neural Magic


Join the Neural Magic Community

Connect and share ideas with fellow ML practitioners.

Get access to our engineering teams and ask questions.

Improve the way you use Neural Magic.

Join on Slack

Get product help and engage with fellow ML practitioners.

Follow on X

Stay current with all things Neural Magic and ML performance.

Watch on YouTube

Deep dive into ML performance with Neural Magic.

Visit our GitHub

See and contribute to our code. And star our repos.