nm-vllm

An enterprise inference server for open-source large language models (LLMs). Deploy on your GPU infrastructure with control over performance, privacy, and security.

1
Baseline - FP16

2
3X Faster - INT4 Marlin

Side-by-side comparison of Neural Magic nm-vllm inference performance with a Mistral 7B model running on an NVIDIA A10 GPU.

Challenges

It's Hard to Execute LLMs On Your Terms

Deploying LLMs is expensive due to compute.

Deploying LLMs on GPUs is an expensive investment. Without careful planning, it is easy to overspend on your LLM inference. And performance is not guaranteed with a high price tag.

LLM inference performance is difficult to maintain as requests scale.

As inference requests increase for an application, many inference servers become a bottleneck while serving LLM inference to growing numbers of concurrent users. This can hinder the user experience by drastically slowing down the generation of the output tokens.

Performant LLM inference is not guaranteed.

Depending on your use case, QPS, input tokens and output tokens, setting up your infrastructure and inference serving solution is non-trivial. To ensure that your inference server of choice can maintain performance across the most popular open-source models with a wide range of workloads requires a deep level of knowledge and orchestration.

how it works

Accelerate LLMs to Production

nm-vllm is an enterprise inference server for at-scale operationalization of performant open-source large language models (LLMs). It enables you to optimize open-source LLMs from native Hugging Face and PyTorch frameworks with Neural Magic SparseML to then deploy directly to production on your infrastructure of choice.

With nm-vllm, enterprises have a choice - from cloud, datacenter, to edge - on where to run open-source LLMs with complete control over performance, privacy, security and model lifecycle.

product overview

Feature Highlights

Ecosystem Compatibility

Seamless integration with a broad set of Hugging Face models.

State-of-the-Art Performance

Optimized for multi-user serving with KV-caching and paged attention.

Software-Delivered Acceleration

Inference acceleration from sparsity and quantization.

Any Model Size

Multi-GPU support via tensor parallelism.

Enterprise Support

Stable production-ready distribution with model-to-silicon optimization support.

Real-Time Insights

Production telemetry and monitoring.

Architecture At A Glance

Snapshot of How It Works