Mar 20, 2024
Author(s)
[Major Product News] Neural Magic Announces GPU Support for LLM Inference!
Over the past several months, our team has been focused on expanding our capabilities to enable LLM inference on GPUs! A few weeks ago, we released our announcement of nm-vllm, our fork of vLLM, with a focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.
nm-vllm v0.1.0 Highlights
- Accelerated model inference with Marlin (4-bit Quantization)
- Developed sparse inference kernels which provide both memory reduction and acceleration of sparse models
We’ll continue to build out more features and capabilities in our nm-vllm repo to deliver performant LLM inference on GPUs. You can see the full details of our first update release here!
1.7 Release Summary: DeepSparse, SparseML, and SparseZoo
Coming into 2024, we have had a deep focus on enabling LLM inference across both CPUs and GPUs. Our 1.7 release contains many improvements in performance, usability, and functionality across all of our libraries to enable ML engineers to optimize, fine-tune, and deploy compressed LLMs.
Full technical release notes for 1.7 are available within our GitHub release indexes linked from our Neural Magic repository. If you have any questions, need assistance, or simply want to introduce yourself, join us in the Neural Magic Community Slack. Our 1.6 release integrated basic telemetry to measure usage for product improvements for all of our open-source libraries. If you would like to disable this telemetry across DeepSparse Community, SparseML, and SparseZoo, follow the instructions under Product Usage Analytics here.
DeepSparse 1.7 Highlights
- New pipelines and OpenAI server compatibility to enable more complex pipelines for Text Generation.
- New evaluation APIs and CLIs have been added with plugins for perplexity and lm-eval-harness to enable easy evaluation of LLMs.
- Continuous batching support has been added for Text Generation serving, enabling improved inference performance over multiple text streams at once.
We have introduced new Hugging Face Transformers-compatible pipelines to enable more fine-tuned control over text generation inference setups. We have also updated the OpenAI server endpoints to enable the streaming of text generation inference via the new pipeline refactor. To enable multiple text streams for throughput inference serving scenarios, we have landed continuous batching support for Text Generation pathways as well. The logging and timing infrastructure for the new pipelines was expanded to enable more thorough tracking and logging, in addition to furthering support for integrations with Prometheus and other standard logging platforms.
To enable LLM evaluation more natively in DeepSparse, we launched new perplexity and lm-eval-harness APIs and CLIs via deepsparse.evaluation
for LLM evaluations. To further enhance the ease of evaluating LLMs, we have provided a LLMPerf usage example showing how you can run DeepSparse LLM Server benchmarks using the OpenAI API endpoint interface to understand how your model will perform under different load settings. We have also expanded our deepsparse.analyze
functionality to work with LLMs as well now for easy model profiling.
Lastly, we have resolved slow compile times for dense LLMs in DeepSparse and fixed the improper handling of the kv_cache
input while using external KV cache management, that led to inaccurate inferences. Now, benchmarking runs for LLMs with internal KV cache no longer crash or report inaccurate numbers.
Click here to view full DeepSparse release notes.
SparseML 1.7 Highlights
- General compression techniques now support LLMs from Hugging Face Transformers.
- Eval pathways have been added for LLMs, with plugins for perplexity and lm-eval-harness.
- AutoModel for casual language models, including quantized and sparse quantized support, has been added.
We have updated our general compression techniques, including fine-tuning and one-shot to now support LLMs built on top of Hugging Face Transformers, full FSDP support, and model stages for transitioning between training and post-training pathways. Additionally, we have updated our recipe pathways to fully support LLMs for SparseML compression techniques, this enables a more seamless application of Neural Magic’s compression techniques to off-the-shelf Hugging Face LLMs.
Export pathways have been simplified across Text Generation and CV use cases to auto-infer arguments such as sequence_length
for transformers, leading to more-error free exports.
Lastly, we have resolved SmoothQuant device forwarding in FSDP setups, and NaN values are no longer crashing. We have also resolved the TypeError
with OBCQ when no sequence_length
is provided.
Click here to view full SparseML release notes.
SparseZoo 1.7 Highlights
- New Text Generation models for various tasks:
- New code generation models:
- Sparsified and baseline: CodeLlama 7B (view)
- Direct download support for LLMs
We have landed new sparsified and baseline models for Text Generation tasks including chat, instruction tuning, code generation, summarization, question answering, and arithmetic reasoning with the new Llama2 7B (view) and Mistral 7B (view) sparsified and baseline models. We also landed new sparsified and baseline CodeLlama 7B (view) models for code generation.
We added direct download support for LLMs via a new chunked download feature to improve handling of large files. Deployment directories are directly downloaded as tar.gz and subsequently unzipped, enabling faster downloads as well.
We also included the ability to overwrite existing files during download, that results in auto-correction of file corruption errors for model downloads.
If you have models you’d like to see sparsified and quantized, and added to SparseZoo let us know. Submit a SparseZoo Model Request form.
Click here to view full SparseZoo release notes.
Final Thoughts
Release 1.7 of DeepSparse, SparseML, and SparseZoo brings substantial advancements in AI model training, optimization, and deployment for LLMs on CPUs. If you’re interested in deploying compressed LLMs on GPUs, check out our new nm-vllm repository on GitHub, which is our new inference server for running LLMs on GPUs!
Neural Magic in the Wild
Beyond release notes, catch up on our product updates through other channels:
- World Summit AI Americas is April 24-25 in Montreal, Canada. Neural Magic has a few extra summit passes we'd love to share with the community. If you are interested, join our Slack Community and respond in the #events channel.
- ICLR (May 7-11) and ICML (July 21-27) are both happening in Vienna, Austria. We plan to be there, updates will be broadcast through our website and social channels.
- The AI Conference 2024 will be held on September 10-11 in San Francisco, California. Our co-founder, Nir Shavit, will be speaking at the event and you will find us in the expo as well.
Cheers,
Rob Greenberg
Sr. Product Manager
Neural Magic