LLMs in Production

The Future
of AI is Open

Neural Magic is on a mission to bring the power of open-source LLMs and vLLM to every enterprise on the planet.

why neural magic

Deploy Open-Source LLMs to Production

Streamline your AI model deployment while maximizing computational efficiency with Neural Magic as your enterprise inference server solution.

Our inference server solutions support the leading open-source LLMs across a broad set of infrastructure so you can run your model wherever, securely, whether that’s in the cloud, a private data center or at the edge.

Deploy your AI models in a scalable and cost-effective way across your organization and unlock the full potential of your models.

Who is Neural Magic?

AI Inference Experts

Neural Magic is a unique company consisting of machine learning, enterprise, and high performance computing experts.

We’ve developed leading enterprise inference solutions that maximize performance and increase hardware efficiency, across both GPUs and CPU infrastructure. We sweat the details, with our inference optimizations taking us deep into the instruction level details across a broad set of GPU and CPU architectures.

As machine learning model optimization experts, we can further increase inference performance through our cutting edge model optimization techniques.

Our inference serving expertise means IT teams responsible for scaling open-source LLMs in production now have a reliable, supported inference server solution and enterprises can deploy AI with confidence.

From Research to Code

State-of-the-Art Model Optimization

In collaboration with the Institute of Science and Technology Austria, Neural Magic develops innovative LLM compression research and shares impactful findings with the open source community, including the state-of-the-art GPTQ and SparseGPT techniques.

Latest LLM Papers

GPTQ

Accurate Post-Training Quantization for Generative...

SparseGPT

Massive Language Models Can Be Accurately Pruned in One-S...

Sparse Fine-Tuning

Sparse Fine-Tuning for Inference Acceleration of Large Languag...

Explore

Run AI Models On Your Terms

nm-vllm

Enterprise inferencing system for deployments of open-source large language models (LLMs) on GPUs.

SparseML

Inference optimization toolkit to compress large language models using sparsity and quantization

Neural Magic Model Repository

Pre-optimized, open-source LLMs for more efficient and faster inferencing.

DeepSparse

Sparsity-aware enterprise inferencing system for LLMs, CV and NLP models on CPUs.

Business Benefits

Advantages of Partnering with Neural Magic

Efficiency Up, Cost Down

Reduce hardware requirements needed to support AI workloads with more efficient inferencing on the infrastructure you already own.

Privacy and Security

Keep your model, your inference requests and your data sets for fine-tuning within the security domain of your organization.

Deployment Flexibility

Bring AI to the data and your users, through the location of your choice, across cloud, datacenter, and edge.

Control of the Model Lifecycle

Deploy within the platforms of your choice, from Docker to Kubernetes, while staying in charge of the model lifecycle to ensure regression-less upgrades.

Testimonials

What People Are Saying

“ Our collaboration with Neural Magic has driven outstanding optimizations for 4th Gen AMD EPYC™ processors. Neural Magic now takes advantage of AMD's new AVX-512 and VNNI ISA extensions, enabling impressive levels of AI inference performance for the world of AI-powered applications and services.”

Kumaran Siva

Corporate VP, Strategic Business Development

“ Scaling Neural Magic’s unique capabilities to run deep learning inference models across Akamai gives organizations access to much-needed cost efficiencies and higher performance as they move swiftly to adopt AI applications.”

Ramanath Iyer

Chief Strategist

“ Neural Magic has the industry's most cost-efficient inference solution. With DeepSparse, we are able to deploy sparse language models trained on Cerebras on standard CPU servers for a fraction of the cost of GPU-based solutions.”

Sean Lie

Chief Technology Officer

“ With Neural Magic, we can now harness CPUs more cost-effectively, reducing infrastructure costs and achieving 4-6x better performance than before.”

Nikola Bulatovic

Data Scientist

“ When it comes to model deployment, Neural Magic helps our customers save money by running inference on CPUs with DeepSparse, without sacrificing speed and performance.”

Eric Korman

Chief Science Officer and Co-Founder