nm-vllm

nm-vllm is an enterprise-ready inferencing system based on the open-source library, vLLM, for at-scale operationalization of performant open-source LLMs. It includes our open-source compression tools for easy optimization of Hugging Face models.

With nm-vllm, enterprises have a choice - from cloud, datacenter, to edge - on where to run open-source LLMs with complete control over performance, security, and model lifecycle.

Challenges

It's Hard to Execute LLMs

Deploying LLMs are infrastructure intensive.

Deploying LLMs on GPUs is an expensive investment. Without careful planning, it is easy to overspend on your LLM inference. And performance is not guaranteed with a high price tag.

Maintaining acceptable response times for inference requests can be complex.

Depending on your use case, QPS, input tokens and output tokens, setting up your infrastructure and inference serving solution is non-trivial. To ensure that your inference server of choice can maintain performance across the most popular open-source models with a wide range of workloads, especially as volume increases, requires a deep level of knowledge and orchestration.

Variable demand patterns may require autoscaling.

As inference requests increase for an application, many inference servers become a bottleneck while serving LLM inference to growing numbers of concurrent users. This can hinder the user experience by drastically slowing down the generation of the output tokens.

Applying optimizations to models is hard without ML experts.

ML research optimization techniques like quantization, sparsity, speculative decoding, and batched LoRA can dramatically reduce the infrastructure (and therefore cost) needed to support an LLM workload, but the ML skill sets needed to apply these techniques are hard to find.

Strategy

Why vLLM

vLLM is the leading community-developed open-source LLM inference server started by UC Berkeley in June of 2023.

The base of developers continues to expand and includes a broad set of commercial companies, of which Neural Magic has become a top contributor and maintainer.

It’s performance and ease-of-use has attracted a growing base of users globally.

Product Overview

Feature Highlights

Enterprise Versions

Stable, supported releases of vLLM that track the upstream, including bug fixes, backported models, hardened CI regression testing, and more.

Speed and Efficiency

Leading inference serving performance that leverages the latest ML and HPC research, like continuous batching, paged attention, quantization, sparsity, speculative decoding, and more.

Certified Inference Hardware

Fully tested and supported hardware including NVIDIA, and AMD, with active projects to support Intel GPUs, AWS Inferentia, Google TPUs, and x86 + ARM CPUs.