Neural Magic 1.6 Product Release

|

For the last several months, we’ve been quite busy building out features across our libraries to enable large language model (LLM) inference on CPUs. We upgraded SparseML to support LLMs and generative models through transformers training, sparsification, and export pipelines. DeepSparse, Neural Magic’s inference runtime, has also been enhanced for performant LLM inference. 

Keep reading for highlights of our 1.6 product release of DeepSparse, SparseML, and SparseZoo libraries. Full technical release notes are available within our GitHub release indexes linked from the specific Neural Magic repository. And if you have any questions, need assistance, or simply want to introduce yourself, join us in the Neural Magic Community Slack.

For the 1.6 release across all of our OSS libraries, we integrated basic telemetry to measure usage for product improvements. To disable this telemetry across DeepSparse Community, SparseML, and SparseZoo, follow the instructions under Product Usage Analytics here.

DeepSparse 1.6

Highlights include: 

  • State-of-the-art performance with sparse LLMs
  • OpenAI- and MLServer-compatible DeepSparse Server now available
  • ARM and MacOS (beta) support
  • And further optimizations for general performance and reduced memory footprint

Decoder-only text generation LLMs, including Llama 2 7B, are now optimized in DeepSparse and offer state-of-the-art performance with sparsity. To get started, run the command pip install deepsparse[llm] and use the TextGeneration Pipeline. For performance details, check out our Sparse Fine-Tuning paper

To enable standard requests for performant LLMs, we launched both OpenAI- and MLServer-compatible DeepSparse Servers. DeepSparse now runs additional models with higher inference performance, including CLIP, Donut, Whisper, and T5. 

Support for ARM processors is now generally available on DeepSparse. For quantized performance, ARMv8.2 or above is required.

We are also excited to announce beta support for macOS so now you can run DeepSparse on Apple laptops. MacOS Ventura (version 13) or above and Apple silicon are required. 

Lastly, we added Azure Marketplace, Google Cloud Marketplace, and DigitalOcean Marketplace integrations, so DeepSparse can now be deployed directly on these platforms for quick and easy inference serving on cloud CPUs. 

Note, through a 1.6.1 hotfix, the filename for Neural Magic DeepSparse Community License in the DeepSparse GitHub repository has been renamed from LICENSE-NEURALMAGIC to LICENSE  for higher visibility in the DeepSparse GitHub repository and the C++ engine package tarball, deepsparse_api_demo.tar.gz.

SparseML 1.6

Highlights include:

  • Support for sparsifying LLMs
  • QuantizationModifier for PyTorch sparsification pathways implemented
  • SparseGPT, OBC, and OBQ one-shot/post-training pruning and quantization modifiers added for PyTorch pathways

We now support LLMs and generative models through transformers training, sparsification, and export pipelines. Specifically, BLOOM, CodeGen, OPT, Falcon, GPTNeo, Llama, MPT, and Whisper model support are now added to SparseML. 

A new QuantizationModifier for PyTorch sparsification pathways has been implemented for cleaner, more robust, and simpler arguments for quantizing models in comparison to the legacy quantization modifier. Additionally, INT4 quantization support has been added for model sparsification and export. SparseGPT, OBC, and OBQ one-shot/post-training pruning and quantization modifiers have also been added for PyTorch pathways, to enable the application of these algorithms to models. 

Lastly, Ultralytics YOLOv8 and CLIP training and sparsification pipelines are now natively supported by SparseML. 

Click here to view full SparseML release notes.

SparseZoo 1.6

Highlights include:

  • New generative AI models now available
  • UX upgrades for SparseZoo website
  • Updated model file structure and stubs simplified

Check out our new Neural Magic SparseZoo homepage redesign with an improved interface and performance boost to improve usability and user experience with our SparseZoo model hub. With a new look and feel and domain structure, SparseZoo now features Neural Magic’s best models across generative AI, computer vision (CV), and natural language processing (NLP) with enhanced search and sorting. 

SparseZoo is now home to our newest sparse generative AI models including: Codegen Mono/Multi, Llama 2, MPT, and OPT

In addition to a new generative AI section on SparseZoo, we have also added new YOLOv8 models and made documentation and configuration upgrades to many of our popular models across computer vision and NLP, like EfficientNet-B0 to B5 and Bert-Large.

Lastly, we updated the model file structure and shortened SparseZoo model stubs which helps expand the number of supported files and reduces the number of bytes that need to be downloaded for model checkpoints, folders, and files.

If you have models you’d like to see added to SparseZoo and sparsified, let us know. Submit a SparseZoo Model Request form. 

Note, through a 1.6.1 hotfix, NOTICE now contains an updated summarized list of SparseZoo models with their appropriate license and GitHub repository attributions for easier user reference; the Neural Magic DeepSparse Community License has been renamed from LICENSE-NEURALMAGIC to LICENSE. Model cards now contain a footer link to the updated SparseZoo NOTICE file.

Click here to view full SparseZoo release notes.

Final Thoughts

Release 1.6 of DeepSparse, SparseML, and SparseZoo brings substantial advancements in AI model training, optimization, and deployment. Neural Magic continues to pave the way for more efficient and powerful AI applications, so both community and enterprises can experience the benefits of this latest release.

Join us in this exciting journey towards an AI-powered future!

- Neural Magic Product Team