|
[Major Product News] Neural Magic Announces GPU Support for LLM Inference! Over the past several months, our team has been focused on expanding our capabilities to enable LLM inference on GPUs! A few weeks ago, we released our announcement of nm-vllm, our fork of vLLM, with a focus on incorporating the latest LLM optimizations like… Read More Neural Magic Product Release Update - Q1  2024
|
Announcing Community Support for GPU Inference Serving Over the past five years, Neural Magic has focused on accelerating inference of deep learning models on CPUs. To achieve this, we did two things: Many of the techniques we used to accelerate CPUs to make them more efficient can also help GPUs in their processing of LLMs.… Read More Bringing the Neural Magic to GPUs
|
For the last several months, we’ve been quite busy building out features across our libraries to enable large language model (LLM) inference on CPUs. We upgraded SparseML to support LLMs and generative models through transformers training, sparsification, and export pipelines. DeepSparse, Neural Magic’s inference runtime, has also been enhanced for performant LLM inference.  Keep reading… Read More Neural Magic 1.6 Product Release
|
Key Takeaways This year has been an exceptionally exciting year for open-source large language models (LLMs). Just 11 months ago proprietary models, like GPT-3, were the only reasonable choice for companies to build generative AI applications. Now, there is a thriving ecosystem of high-quality open-source models, like Meta’s Llama family. In February, Meta released the… Read More Fast Llama 2 on CPUs With Sparse Fine-Tuning and DeepSparse
|
The AI space is abuzz with large language models (LLMs), but using them locally is a challenge due to their enormous size. Organizations that want to use these models for applications such as question answering must either invest in expensive cloud infrastructure or use closed-source models. By using closed-source models, companies also give up their… Read More Run a Medical Chatbot on CPUs With Sparse LLMs and DeepSparse
|
In the burgeoning field of AI, large language models (LLMs) currently dominate the headlines, producing applications that span from writing assistance to conversational AI. The popularity of these models is driven by their ability to generate text that is not only coherent but also contextually relevant. Default LLM inference pipelines operate by choosing the next… Read More Navigating the Nuances of Text Generation: How to Control LLM Outputs With DeepSparse
|
Since OpenAI's introduction of ChatGPT, developers worldwide have widely embraced the OpenAI API as the go-to solution for making API requests to their language models. However, in response to the growing demand within open-source communities for more accessible and cost-effective language model alternatives, users have started to explore the integration of DeepSparse with OpenAI's API.… Read More Integrating DeepSparse With OpenAI’s API for Fast Local LLMs
|
LangChain is one of the most exciting tools in Generative AI, with many interesting design paradigms for building large language model (LLM) applications. However, developers who use LangChain have to choose between expensive APIs or cumbersome GPUs to power LLMs in their chains. With Neural Magic, developers can accelerate their model on CPU hardware, to… Read More Building Sparse LLM Applications on CPUs With LangChain and DeepSparse
|
The arrival of capable open-source large language models (LLMs) like MosaicML’s MPT and Meta’s Llama 2 has made it easier for enterprises to explore generative AI to address their business challenges. Yet, adoption of open-source models for commercial applications is still hampered by two key problems: In our recent paper, Sparse Fine-Tuning for Inference Acceleration… Read More Sparse Fine-Tuning for Accelerating Large Language Models with DeepSparse
|
As artificial intelligence (AI) and machine learning (ML) have become the backbone of technological innovation, companies race to provide the best solutions for businesses to increase optimization, efficiency, and scalability. Our founders launched Neural Magic so customers didn’t have to hit the same roadblocks they encountered, when it came to utilizing maximum hardware capabilities for… Read More Optimal CPU AI Inference with AMD EPYC™ 8004 Series Processors and Neural Magic DeepSparse