Neural Magic Compress

Neural Magic Compress is the ultimate developer subscription for enterprises aiming to build and deploy efficient GenAI models. Leveraging our proven expertise in model compression and performant inference, the Neural Magic Compress subscription provides enterprise-grade support, resources, and tools to reduce time-to-market, optimize operational costs, and scale AI workflows.

Neural Magic's Fully Recovered and SOTA Compressed Model Versions

Llama 3.1 (8B, 70B, 405B) / Llama 3.2 (1B, 3B) / Llama 3.3 (70B)
Qwen2.5 (0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B)
Phi-3 (mini, small, medium)
Mistral (v0.3, Nemo) / Mixtral (7x8B, 7x22B)
DeepSeek-Coder (Lite, Base)
... and more
 

Inference Benchmark Scenarios

Single Stream
Multi Stream
Throughput for Code Fixing
Code Generation
Conversational AI
Instruction Following
Structured Generation
... and more
 

Evaluations

Academic Benchmarks
Code Generation
Conversational AI
Summarization
Long Context
Structured Generation
... and more

Challenges

Deploying Efficient GenAI Models is Costly and Complex

High Infrastructure Costs

GenAI models demand substantial computational resources, significantly driving up deployment expenses and compute requirements, particularly at enterprise scale.

Slow Response Times

The large size and auto-regressive nature of GenAI models result in slow responses without considerable infrastructure investment.

Accuracy Trade-Offs

Achieving high accuracy with GenAI models necessitates larger sizes, leading to increased costs and reduced deployment efficiency.

Delayed Time-to-Market

Balancing performance, cost, and accuracy, along with incorporating model compression, complicates workflows and delays deployment readiness.

Icon

Enterpise Strategy

Why
Neural Magic Compress

Neural Magic leads open-source AI innovation, recognized as the top commercial contributor to vLLM and LLM Compressor, and a pioneer in state-of-the-art model compression research and publications. With our Compress subscription, enterprises gain:

  • SOTA Compressed Models: Achieve full accuracy recovery and optimized inference performance tailored for diverse enterprise use cases.
  • Robust Compression Toolkit: Using our llm-compressor toolkit, we will guide you through the process of compressing your existing models.
  • Enterprise-Grade Support: Access dedicated expertise, feature development, and bug fixes for accelerated AI success.

Our Commitment: To transform enterprise AI by merging groundbreaking research with actionable performance, delivering efficient, scalable, and production-ready solutions.

Subscription Entitlements

Neural Magic Compress

Compressed Model Repo

Access a private library of expertly compressed models, including quantized, sparse, and PEFT-ready models, optimized for accuracy recovery, diverse hardware performance, and seamless enterprise integration.

Comprehensive Benchmarks

Gain actionable insights with hardware-specific benchmarks and cost analysis, enabling confident decisions on model selection, deployment configurations, and resource optimization.

Real-World Evaluations

Review detailed evaluations across diverse metrics, ensuring models align with enterprise use cases and meet accuracy and performance requirements.

Detailed Research Reports

Access exclusive research reports, hyperparameter sweeps, and whitepapers detailing cutting-edge model compression and fine-tuning techniques.

Expert Guidance

Receive expert guidance on hyperparameter tuning, fine-tuning workflows, and model compression to ensure optimal performance.

Enterprise Support

Leverage hands-on support for LLM Compressor integration, bug fixes, feature requests, and deployment enablement for seamless pipeline adoption.

Case Studies

Compress Success Stories

See how Neural Magic Compress is used in the field

Gaming

 

Hundreds of Thousands of Daily Code Generations

Challenge:

The baseline Llama 70B model required 2 8xA100 systems to meet their performance requirements (10 QPS @ 50ms TPOT)

 

Solution:

Deploying Neural Magic's INT8 quantized Llama 70B on half the GPU resources, meeting performance and accuracy requirements

Retail

 

Millions of Daily JSON Extractions

Challenge:

Forced into deploying a baseline Llama 70B model due to lack of realized performance benefits from quantization on their specific hardware

 

Solution:

Collaborated with Neural Magic engineers to ensure hardware-compatible and accurate quantization, resulting in forty percent fewer GPU hours with full accuracy recovery

Experience Neural Magic