Icon

Community

Ask. Share.
Improve Your ML Magic.

Together with our community, Neural Magic brings innovation to GPU and CPU inference. Connect and engage with fellow ML Practitioners interested in model compression and deployment with best-in-class performance and efficiency.

Star and Contribute

Connect And Follow us:

vLLM Open Office Hours

As one of the top contributors to the vLLM project, Neural Magic is excited to partner with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly open office hours.

Come with questions to learn more about the vLLM project and how Neural Magic can help you bring the power of open-source LLMs and vLLM to your enterprise.

Icon

Community Tools

Accelerate Your Inference Now

Compress your models using our open-source model optimization libraries and deploy on hardware of your choice, CPU or GPU.

Deploy on GPUs

nm-vllm

Incorporate the latest LLM optimizations for optimal GPU performance. Deploy LLMs on GPUs of choice using nm-vllm, our opinionated fork of the popular vLLM project.

Github Icon

Star

253

Tool Image

Deploy on CPUs

DeepSparse

Accelerate LLMs, CV, and NLP models with DeepSparse, our free-to-try inference runtime. Use CPUs you already own, x86 and ARM.

Github Icon

Star

3038

Tool Image

Optimize Models for Inference

SparseML

Compress your LLMs, CV, and NLP models for fast and efficient inference using SparseML. Apply quantization and sparsity using pre-configured recipes. Reduce hardware requirements and costs.

Tool Image
Tool Image
Icon

Events, Webinars, and Meetups

Let’s Get Together

Event Image

[Virtual] vLLM Open Office Hours

Join vLLM and Neural Magic experts for open office hours to answer your questions on all things related to optimized LLM inference and accelerated enterprise ML production deployments using vLLM and Neural Magic.

DATE:

Nov 14, 2024

LOCATION:

Online via Zoom

Event Image

NeurIPS 2024

We are excited to share our latest research and learn from everyone at this year's NeurIPS! Will you be there?

DATE:

Dec 09, 2024

LOCATION:

Vancouver, Canada

Icon

RESEARCH PAPERS

Get Technical With Our Published ML Research

Sparse Fine-Tuning for Inference Acceleration of Large Language Models

We consider fine-tuning pre-trained LLMs on specialized tasks while inducing sparsity in their weights. We observe that standard loss-based fine-tuning may fail to recover accuracy at high sparsities. To address this, we perform a study of distillation-type losses, determining an L2-based distillation approach which enables accurate recovery at higher sparsities, across all models.

Neural Magic & IST Austria

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

We show that large-scale generative pre-trained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity.

Neural Magic & IST Austria

Sparse*BERT: Sparse Models Generalize To New Tasks and Domains

Models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks. Models that are pruned during pre-training using general domain masked language models can transfer to novel domains and tasks without extensive hyperparameter exploration or specialized approaches.

Neural Magic

Icon

Join the Neural Magic Community

Connect and share ideas with fellow ML practitioners.

Get access to our engineering teams and ask questions.

Improve the way you use Neural Magic.

Join on Slack

Get product help and engage with fellow ML practitioners.

Follow on X

Stay current with all things Neural Magic and ML performance.

Watch on YouTube

Deep dive into ML performance with Neural Magic.

Visit our GitHub

See and contribute to our code. And star our repos.