LLMs on CPUs.

LLMs on CPUs.

From research to code, use model sparsity to accelerate open-source LLMs and bring operational simplicity to GenAI deployments.

Accelerated Inference With Sparsity

>99% accuracy of FP32 MPT model on GSM dataset

State-of-the-Art Model Optimization Research

In collaboration with the Institute of Science and Technology Austria, Neural Magic develops innovative LLM compression research and shares impactful findings with the open source community, including the state-of-the-art Sparse Fine-Tuning technique.



Deploy state-of-the-art models trained on your data with GPU-class performance on commodity CPUs.

Flexible Deployment

Run consistently across cloud, data center, and edge with any hardware provider from Intel to AMD to ARM.

Infinite Scalability

Bring horizontal and vertical scale to your ML solutions with physical, virtual, containerized, and serverless deployment options.

Ease of Integration

Use clean APIs for integrating models into applications and monitoring them in production.

Word on the Street

“Our close collaboration with Neural Magic has driven outstanding optimizations for 4th Gen AMD EPYC™ processors. Their DeepSparse Platform takes advantage of our new AVX-512 and VNNI ISA extensions, enabling outstanding levels of AI inference performance for the world of AI-powered applications and services.”

- Kumaran Siva, Corporate VP, Software & Systems Business Development, AMD

“With Neural Magic, we can now harness CPUs more cost-effectively, reducing infrastructure costs and achieving 4-6x better performance than before. ”

- Nikola Bulatovic, Data Scientist, Uhura Solutions

“The DeepSparse program showed dramatically higher numbers of queries processed per second than many of the standard systems...Neural Magic's work has broad implications for AI and for the chip community.”

“We used the Neural Magic Inference Engine with our sparse models and the results were nothing short of impressive. By using our sparsity method, we were able to achieve almost twice the inference speed with 80% sparsity while still passing the bar of the tinyMLPerf challenge.”

“[Neural Magic is] literally crushing it when it comes to delivering on their mission, to make deep learning more accessible to everybody.”

- Francesco Pochetti, Data Scientist, AWS Machine Learning Hero

Our Products


Sparsity-aware inference runtime for GPU-class performance on CPUs.
Get Started
sparse ml logo


Open-source libraries for applying sparsification recipes to neural networks.
Get Started
sparse zoo logo


Open-source model repository for sparse and sparse-quantized models.
Get Started

Blog and News

Join the Neural Magic Community