Ask. Share.
Improve Your ML Magic.

Together with our community, Neural Magic brings innovation to GPU and CPU inference. Connect and engage with fellow ML Practitioners interested in model compression and deployment with best-in-class performance and efficiency.

Star and Contribute

Connect And Follow us:

Accelerate LLM Inference With Marlin

Marlin, a mixed-precision matrix multiplication kernel, represents a significant advancement in matrix multiplication performance for LLMs, enabling 4x speedups with FP16xINT4 computations for batch sizes up to 32. We already upstreamed Marlin into the vLLM project, along with other features like FP8 quantization, AQLM INT2 kernels, activation quantization, CMake build system, automatic prefix caching, and more.


Community Tools

Accelerate Your Inference Now

Compress your models using our open-source model optimization libraries and deploy on hardware of your choice, CPU or GPU.

Deploy on GPUs


Incorporate the latest LLM optimizations for optimal GPU performance. Deploy LLMs on GPUs of choice using nm-vllm, our opinionated fork of the popular vLLM project.

Github Icon



Tool Image

Deploy on CPUs


Accelerate LLMs, CV, and NLP models with DeepSparse, our free-to-try inference runtime. Use CPUs you already own, x86 and ARM.

Github Icon



Tool Image

Optimize Models for Inference


Compress your LLMs, CV, and NLP models for fast and efficient inference using SparseML. Apply quantization and sparsity using pre-configured recipes. Reduce hardware requirements and costs.

Tool Image
Tool Image

Events, Webinars, and Meetups

Let’s Get Together

Event Image

.NEXT 2024

We'll be at the Nutanix customer conference in Barcelona. Meet our team and join our theater session to learn how we can help you own your GenAI strategy.


May 22, 2024


Barcelona, Spain

Event Image

The AI Conference 2024

Let's shape the future of AI together. Our co-founder, Nir Shavit, is speaking on Tuesday. Check his talk and meet our team while there.


Sep 10, 2024


San Francisco, CA

Event Image

NeurIPS 2024

We are excited to share our latest research and learn from everyone at this year's NeurIPS! Will you be there?


Dec 09, 2024


Vancouver, Canada



Get Technical With Our Published ML Research

Sparse Fine-Tuning for Inference Acceleration of Large Language Models

We consider fine-tuning pre-trained LLMs on specialized tasks while inducing sparsity in their weights. We observe that standard loss-based fine-tuning may fail to recover accuracy at high sparsities. To address this, we perform a study of distillation-type losses, determining an L2-based distillation approach which enables accurate recovery at higher sparsities, across all models.

Neural Magic & IST Austria

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

We show that large-scale generative pre-trained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity.

Neural Magic & IST Austria

Sparse*BERT: Sparse Models Generalize To New Tasks and Domains

Models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks. Models that are pruned during pre-training using general domain masked language models can transfer to novel domains and tasks without extensive hyperparameter exploration or specialized approaches.

Neural Magic


Join the Neural Magic Community

Connect and share ideas with fellow ML practitioners.

Get access to our engineering teams and ask questions.

Improve the way you use Neural Magic.

Join on Slack

Get product help and engage with fellow ML practitioners.

Follow on X

Stay current with all things Neural Magic and ML performance.

Watch on YouTube

Deep dive into ML performance with Neural Magic.

Visit our GitHub

See and contribute to our code. And star our repos.