Neural Magic Community - Neural Magic

Community

Ask. Share.
Improve Your ML Magic.

Together with our community, Neural Magic brings innovation to CPU and GPU inference. Connect and engage with fellow ML Practitioners interested in model compression and deployment with best-in-class performance and efficiency.

Star and Contribute

Connect And Follow us:

Neural Magic Leaps Into GPU Acceleration

We’re thrilled to announce our leap into GPU acceleration with the launch of nm-vllm, aimed at supercharging inference serving for compressed LLMs on GPUs.

Community Tools

Accelerate Your Inference Now

Compress your models using our open-source model optimization libraries and deploy on hardware of your choice, CPU or GPU.

Deploy on GPUs

nm-vllm

Incorporate the latest LLM optimizations for optimal GPU performance. Deploy LLMs on GPUs of choice using nm-vllm, our opinionated fork of the popular vLLM project.

Star

264

Deploy on CPUs

DeepSparse

Accelerate LLMs, CV, and NLP models with DeepSparse, our free-to-try inference runtime. Use CPUs you already own, x86 and ARM.

Star

3156

Optimize Models for Inference

SparseML

Compress your LLMs, CV, and NLP models for fast and efficient inference using SparseML. Apply quantization and sparsity using pre-configured recipes. Reduce hardware requirements and costs.

Star

2144

World Summit AI Americas

Stop by our booth #B50 for a brief demo of Neural Magic. On Wednesday, join our Innovation Insight where we'll share the HPC problem no one is talking about.

DATE:

Apr 24, 2024

LOCATION:

Montreal, Canada

The AI Conference 2024

Let's shape the future of AI together. Our co-founder, Nir Shavit, is speaking on Tuesday. Check his talk and meet our team while there.

DATE:

Sep 10, 2024

LOCATION:

San Francisco, CA

NeurIPS 2024

We are excited to share our latest research and learn from everyone at this year's NeurIPS! Will you be there?

DATE:

Dec 09, 2024

LOCATION:

Vancouver, Canada

Open Source

Mar 20, 2025

3.5X Faster Vision-Language Models with Quantization

Open Source

Mar 14, 2025

Optimizing vLLM for DeepSeek-R1

Open Source

Feb 27, 2025

Quantized DeepSeek-R1 Models: Deployment-Ready Reasoning Models

Sparse Fine-Tuning for Inference Acceleration of Large Language Models

We consider fine-tuning pre-trained LLMs on specialized tasks while inducing sparsity in their weights. We observe that standard loss-based fine-tuning may fail to recover accuracy at high sparsities. To address this, we perform a study of distillation-type losses, determining an L2-based distillation approach which enables accurate recovery at higher sparsities, across all models.

Neural Magic & IST Austria

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

We show that large-scale generative pre-trained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity.

Neural Magic & IST Austria

Sparse*BERT: Sparse Models Generalize To New Tasks and Domains

Models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks. Models that are pruned during pre-training using general domain masked language models can transfer to novel domains and tasks without extensive hyperparameter exploration or specialized approaches.

Neural Magic