LLMs on CPUs.
Period.
Period.
LLMs on CPUs.
Period.
From research to code, use model sparsity to accelerate open-source LLMs and bring operational simplicity to GenAI deployments.
Accelerated Inference With Sparsity
>99% accuracy of FP32 MPT model on GSM dataset
State-of-the-Art Model Optimization Research
In collaboration with the Institute of Science and Technology Austria, Neural Magic develops innovative LLM compression research and shares impactful findings with the open source community, including the state-of-the-art Sparse Fine-Tuning technique.
Recent LLM Papers

Benefits
Performance Deploy state-of-the-art models trained on your data with GPU-class performance on commodity CPUs.
Flexible Deployment Run consistently across cloud, data center, and edge with any hardware provider from Intel to AMD to ARM.
Infinite Scalability Bring horizontal and vertical scale to your ML solutions with physical, virtual, containerized, and serverless deployment options.
Ease of Integration Use clean APIs for integrating models into applications and monitoring them in production.


Word on the Street
“Our close collaboration with Neural Magic has driven outstanding optimizations for 4th Gen AMD EPYC™ processors. Their DeepSparse Platform takes advantage of our new AVX-512 and VNNI ISA extensions, enabling outstanding levels of AI inference performance for the world of AI-powered applications and services.”
- Kumaran Siva, Corporate VP, Software & Systems Business Development, AMD

“With Neural Magic, we can now harness CPUs more cost-effectively, reducing infrastructure costs and achieving 4-6x better performance than before. ”
- Nikola Bulatovic, Data Scientist, Uhura Solutions

“The DeepSparse program showed dramatically higher numbers of queries processed per second than many of the standard systems...Neural Magic's work has broad implications for AI and for the chip community.”

Our Products
Sparsify
ML model optimizer to accelerate inferencing at scale.
Coming Soon