Blog - Neural Magic

Explore Our Latest Insights

vLLM Brings FP8 Inference to the Open-Source Community

vLLM Now Supports FP8 on NVIDIA GPUs vLLM, a leading open-source LLM serving engine, has taken a significant leap forward in its recent 0.5 release by incorporating FP8 quantization support. This cutting-edge format promises to revolutionize LLM deployment by dramatically improving efficiency without sacrificing model quality. The implementation of FP8 support is the result of… Read More Blog

vLLM Now Supports FP8 on NVIDIA GPUs vLLM, a leading open-source LLM serving engine, has taken a sig...

07.15.2024

Deploy Llama 3 8B with vLLM

The Power of LLMs Large Language Models (LLMs) have transformed AI, enabling machines to understand and generate human-like text. These models, trained on vast datasets, excel at tasks like answering questions, summarizing content, and providing customer support. Their versatility makes them valuable across healthcare, finance, education, entertainment, and nearly all other industries However, achieving high… Read More Blog

The Power of LLMs Large Language Models (LLMs) have transformed AI, enabling machines to understand ...

06.18.2024

Pushing the Boundaries of Mixed-Precision LLM Inference With Marlin

Key Takeaways In the rapidly evolving landscape of large language model (LLM) inference, the quest for speed and efficiency on modern GPUs has become a critical challenge. Enter Marlin, a groundbreaking Mixed Auto-Regressive Linear kernel that unlocks unprecedented performance for FP16xINT4 matrix multiplications. Developed by Elias Frantar at IST-DASLab and named after one of the… Read More Blog

Key Takeaways In the rapidly evolving landscape of large language model (LLM) inference, the quest f...

04.17.2024