Video

Deploy LLMs More Efficiently with vLLM and Neural Magic

Learn why vLLM is the leading open-source inference server and how Neural Magic works with enterprises to build and scale vLLM-based model services with more efficiency and cost savings.

The ecosystem of open-source LLMs has exploded over the past year. A new model tops the leaderboard almost every week. Enterprises can now deploy state-of-the-art, open-source LLMs like Llama 3 securely on their infrastructure of choice, fine-tuned with their data for domain-specific use cases, at a significantly lower cost than proprietary APIs.

vLLM has emerged as the most popular inference server to deploy open-source LLMs with leading performance, ease of use, broad model support, and heterogeneous hardware backends.

Neural Magic is a leading contributor to the vLLM project and offers nm-vllm, an enterprise-ready vLLM distribution. nm-vllm includes:

  • Stable builds of vLLM with long-term support, model to silicon
  • Tools and expertise for optimizing LLMs for inference with techniques like quantization and sparsity
  • Reference architectures for scalable deployments with Kubernetes
  • Integration of telemetry and key monitoring systems

Watch our webinar recording from July 11, 2024, to learn:

  • Why vLLM is the leading open-source inference server for LLMs
  • How Neural Magic works with enterprises to build and scale vLLM-based model services