Bringing the Neural Magic to GPUs

Mar 05, 2024



Announcing Community Support for GPU Inference Serving

Over the past five years, Neural Magic has focused on accelerating inference of deep learning models on CPUs. To achieve this, we did two things:

  1. We developed leading research for compressing models with techniques like sparsity and quantization and made it consumable via our open-source model optimization toolkit, SparseML1.
  2. We built an inference runtime, DeepSparse, that accelerates sparse-quantized models on x86 and ARM CPU architectures.

Many of the techniques we used to accelerate CPUs to make them more efficient can also help GPUs in their processing of LLMs. So it is early, but today we are excited to announce nm-vllm, our initial community release to support GPU inference serving. nm-vllm is based on the vLLM project, an open-source LLM inference server led by researchers at UC Berkeley. In this inaugural release of nm-vllm, we include support for the acceleration of sparse FP16 LLMs, as well as state-of-the-art 4-bit quantized inference kernel (called Marlin) to accelerate inference in server mode scenarios.

Why vLLM?

We made the decision to contribute to and build on top of vLLM for the following three reasons.

First, vLLM is an inference server, which means it is optimized for enterprise deployments with multiple users and applications querying the same model. vLLM already has support for key features, including:

  • Continuous batching of inference requests (increases throughput with limited hit on latency) 
  • Optimized memory management via paged attention (increases the maximum concurrent users) 
  • Built-in tensor parallelism (for multi-GPU deployments)

Second, vLLM has a vibrant community of contributors. As a result, it can quickly support the newest models and regularly receives contributions for the latest and greatest inference optimizations developed in the research community.

Third, vLLM is PyTorch-based and provides a pathway to support a broad range of hardware accelerators. It currently supports NVIDIA and AMD GPUs, and there are active projects from hardware vendors like Intel, AWS Inferentia, and Google TPUs to upstream support to vLLM.

Neural Magic’s Open-Source Contributions to vLLM

Neural Magic is committed to supporting the open-source project with regular contributions to the core technology. Here are some of our PRs to date (with many more to come):

Introducing nm-vllm

nm-vllm is now available for community use on PyPI. In addition to all of the existing features and supported models in the upstream vLLM project, nm-vllm has infused newly developed inference kernels that enable acceleration of sparse and quantized models.

Sparse Kernel

Neural Magic in collaboration with IST-Austria, developed SparseGPT, and Sparse Fine-Tuning, the leading algorithms for pruning LLMs, which remove at least half of a model's weights with limited impact on accuracy. With a newly-developed sparse inference kernel, organizations can use nm-vllm to achieve a reduction in memory and acceleration with their LLM. (See graphs below.)

This is why it works. For each token generated, all weights of the LLM must be read. Since LLMs are usually several gigabytes in size, this can take a lot of time and cause a significant portion of the pipeline to be highly memory-bound, which means a lot of time is spent waiting for the weights to be read. This makes compression critical. All of the zeros in sparse weights can be efficiently compressed, so we can simply read less memory and improve latency. This reduction in memory usage has direct benefits on the limited memory attached to GPUs, which creates better throughput and enables support for larger models.

Quantization Kernel

Also developed in collaboration with IST-Austria, GPTQ is the leading quantization algorithm for LLMs, which compresses model weights from 16 bits to 4 bits with limited impact on accuracy. nm-vllm includes support for the recently-developed Marlin kernel to accelerate GPTQ models. Before Marlin, the existing kernel for INT4 inference failed to scale in scenarios with multiple concurrent users. As shown in the chart below, you can see the Marlin kernel accelerates inference over the current GPTQ kernel by 7x and over FP16 by 3x under heavy load, dramatically improving the performance for inference serving.

LLM Optimizations With SparseML

Users can apply LLM optimizations, like quantization and sparsity, to compress their LLMs, using Neural Magic SparseML. SparseML is an open-source model optimization library that provides leading research algorithms for optimizing models for practitioners to use. With SparseML, users can reduce hardware requirements to support their workloads which reduces overall cost. Neural Magic has a variety of compressed models, ready to use, available on our Hugging Face organization. For support with compressing your models, join our community Slack channel.

What’s Next?

We are excited to land the initial set of sparse and weight-quantization kernels in nm-vllm, but this is just the beginning of our vision. Leveraging our past work on CPU acceleration, we have already begun further optimizations that will allow you to:

  • Combine sparsity and quantization for additional compression and speed.
  • Support activation quantization, in addition to weight compression, enabling inference acceleration and memory reduction.
  • Decode optimizations like Speculative Decoding and draft-model free variants.

To make it easy to get started, we created three Jupyter Notebooks where you can:

  1. Deploy models from the Hugging Face Hub with nm-vllm.
  2. Apply SparseGPT to Hugging Face model and deploy with nm-vllm.
  3. Apply GPTQ to a Hugging Face model, convert to Marlin, and deploy with nm-vllm.

If you have a project you would like to explore or wish to follow new developments, come join us in Slack or directly on GitHub.

We want to emphasize that today is only day one, but a great day for more efficient inferencing for all!

Was this article helpful?



Join the Conversation