Neural Magic Community - Neural Magic

Community

Ask. Share.
Improve Your ML Magic.

Together with our community, Neural Magic brings innovation to GPU and CPU inference. Connect and engage with fellow ML Practitioners interested in model compression and deployment with best-in-class performance and efficiency.

Star and Contribute

Connect And Follow us:

vLLM Open Office Hours

As one of the top contributors to the vLLM project, Neural Magic is excited to partner with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly open office hours.

Come with questions to learn more about the vLLM project and how Neural Magic can help you bring the power of open-source LLMs and vLLM to your enterprise.

Community Tools

Accelerate Your Inference Now

Compress your models using our open-source model optimization libraries and deploy on hardware of your choice, CPU or GPU.

Deploy on GPUs

nm-vllm

Incorporate the latest LLM optimizations for optimal GPU performance. Deploy LLMs on GPUs of choice using nm-vllm, our opinionated fork of the popular vLLM project.

Star

264

Deploy on CPUs

DeepSparse

Accelerate LLMs, CV, and NLP models with DeepSparse, our free-to-try inference runtime. Use CPUs you already own, x86 and ARM.

Star

3158

Optimize Models for Inference

SparseML

Compress your LLMs, CV, and NLP models for fast and efficient inference using SparseML. Apply quantization and sparsity using pre-configured recipes. Reduce hardware requirements and costs.

Star

2145

[Virtual] vLLM Open Office Hours

Join vLLM and Neural Magic experts for open office hours to answer your questions on all things related to optimized LLM inference and accelerated enterprise ML production deployments using vLLM and Neural Magic.

DATE:

Nov 14, 2024

LOCATION:

Online via Zoom

NeurIPS 2024

We are excited to share our latest research and learn from everyone at this year's NeurIPS! Will you be there?

DATE:

Dec 09, 2024

LOCATION:

Vancouver, Canada

Open Source

Mar 20, 2025

3.5X Faster Vision-Language Models with Quantization

Open Source

Mar 14, 2025

Optimizing vLLM for DeepSeek-R1

Open Source

Feb 27, 2025

Quantized DeepSeek-R1 Models: Deployment-Ready Reasoning Models

Sparse Fine-Tuning for Inference Acceleration of Large Language Models

We consider fine-tuning pre-trained LLMs on specialized tasks while inducing sparsity in their weights. We observe that standard loss-based fine-tuning may fail to recover accuracy at high sparsities. To address this, we perform a study of distillation-type losses, determining an L2-based distillation approach which enables accurate recovery at higher sparsities, across all models.

Neural Magic & IST Austria

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

We show that large-scale generative pre-trained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity.

Neural Magic & IST Austria

Sparse*BERT: Sparse Models Generalize To New Tasks and Domains

Models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks. Models that are pruned during pre-training using general domain masked language models can transfer to novel domains and tasks without extensive hyperparameter exploration or specialized approaches.

Neural Magic