nm-vllm

nm-vllm is an enterprise-ready inferencing system based on the open-source library, vLLM, for at-scale operationalization of performant open-source LLMs. It includes our open-source compression tools for easy optimization of Hugging Face models.

With nm-vllm, enterprises have a choice - from cloud, datacenter, to edge - on where to run open-source LLMs with complete control over performance, security, and model lifecycle.

Challenges

It's Hard to Execute LLMs

Deploying LLMs are infrastructure intensive.

Deploying LLMs on GPUs is an expensive investment. Without careful planning, it is easy to overspend on your LLM inference. And performance is not guaranteed with a high price tag.

Maintaining acceptable response times for inference requests can be complex.

Depending on your use case, QPS, input tokens and output tokens, setting up your infrastructure and inference serving solution is non-trivial. To ensure that your inference server of choice can maintain performance across the most popular open-source models with a wide range of workloads, especially as volume increases, requires a deep level of knowledge and orchestration.

Variable demand patterns may require autoscaling.

As inference requests increase for an application, many inference servers become a bottleneck while serving LLM inference to growing numbers of concurrent users. This can hinder the user experience by drastically slowing down the generation of the output tokens.

Applying optimizations to models is hard without ML experts.

ML research optimization techniques like quantization, sparsity, speculative decoding, and batched LoRA can dramatically reduce the infrastructure (and therefore cost) needed to support an LLM workload, but the ML skill sets needed to apply these techniques are hard to find.

Icon

Strategy

Why vLLM

vLLM is the leading community-developed open-source LLM inference server started by UC Berkeley in June of 2023.

The base of developers continues to expand and includes a broad set of commercial companies, of which Neural Magic has become a top contributor and maintainer.

It’s performance and ease-of-use has attracted a growing base of users globally.

Product Overview

Feature Highlights

Enterprise Versions

Stable, supported releases of vLLM that track the upstream, including bug fixes, backported models, hardened CI regression testing, and more.

Speed and Efficiency

Leading inference serving performance that leverages the latest ML and HPC research, like continuous batching, paged attention, quantization, sparsity, speculative decoding, and more.

Certified Inference Hardware

Fully tested and supported hardware including NVIDIA, and AMD, with active projects to support Intel GPUs, AWS Inferentia, Google TPUs, and x86 + ARM CPUs.

Certified Models

Validated performance and accuracy of key LLMs in easy-to-use Hugging Face compatible formats, including registry of pre-optimized checkpoints.

Workload Analysis

Operational telemetry and multi-deployment dashboards to enable deep insights both in pre-production and production deployments.

Model Optimization

Tools and support for quantizing and sparsifying models for increased GPU efficiency.

Setup to Scale

Kubernetes/Kserve integrations for operational scaling.

Model-to-Silicon Support

Enterprise-grade support including SLAs and dedicated Slack channel.

Compare

Performance Difference

Side-by-side comparison of Neural Magic nm-vllm inference performance with a Mistral 7B model running on an NVIDIA A10 GPU.

1
Baseline - FP16

2
3X Faster - INT4 Marlin

Architecture At A Glance

How It Works

Icon

Leaders

Why Neural Magic and vLLM

1

vLLM is the leading open-source LLM inference server.

2

Neural Magic is a leading commercial contributor and committer to vLLM.

3

Neural Magic is a leader in model optimization.

Video Image
Icon

Engagement

Teamwork

Our team of engineers will act as an extension of your team to make your vLLM deployments successful.

Here are a few examples of the type of projects we can work on with you.