Building Sparse LLM Applications on CPUs With LangChain and DeepSparse


LangChain is one of the most exciting tools in Generative AI, with many interesting design paradigms for building large language model (LLM) applications. However, developers who use LangChain have to choose between expensive APIs or cumbersome GPUs to power LLMs in their chains. With Neural Magic, developers can accelerate their model on CPU hardware, to make it easy to build powerful LangChain applications.

You can achieve remarkable speeds on CPUs by deploying sparse LLMs with DeepSparse, an inference runtime that offers up to 7X faster text generation on CPUs by leveraging the model's sparsity. This means you can run your LLMs wherever there is a CPU: in the cloud, in a data center, or at the edge.

Continue reading to learn how to use LangChain and DeepSparse to deploy LLM applications on CPUs.

How to Use LangChain With DeepSparse

First, confirm your hardware is supported, and then install the required packages:

pip install deepsparse-nightly[llm] langchain

To use DeepSparse in LangChain, import the DeepSparse class and pass the path to a model. You can use a Hugging Face model or a SparseZoo stub.

from langchain.llms import DeepSparse
prompt = "Who was the first president of the United States of America?"
llm = DeepSparse(model="hf:neuralmagic/mpt-7b-chat-pruned50-quant")
# The first president of the United States of America was George Washington. He was elected in 1789 and served two terms until 1797.

Stream Content

The DeepSparse LangChain integration supports streaming the LLM's response by passing the streaming argument as True:

from langchain.llms import DeepSparse

llm = DeepSparse(model="hf:neuralmagic/mpt-7b-chat-pruned50-quant", streaming=True)
for chunk in"Tell me a joke", stop=["'","\n"]):
    print(chunk, end='', flush=True)

Use LangChain Prompt Templates

The DeepSparse class is a LangChain wrapper, which means you can access all the other LangChain utilities, such as prompt templates and parsing outputs:

from langchain import PromptTemplate
from langchain.output_parsers import CommaSeparatedListOutputParser

llm_parser = CommaSeparatedListOutputParser()
llm = DeepSparse(model='hf:neuralmagic/mpt-7b-chat-pruned50-quant')

prompt = PromptTemplate(
    template="List how to {do}",

output = llm.predict(text=prompt.format(do="Become a great software engineer"))


Chat With PDF Application on a CPU

To see an LLM application running on a CPU, build a PDF chat application using DeepSparse, Chainlit, and LangChain. This application will require the following packages:

pip install deepsparse-nightly[llm] chainlit langchain PyPDF2 sentence_transformers chromadb

Start by importing all the required modules:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
import os
import chainlit as cl
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import DeepSparse
from io import BytesIO
import PyPDF2

Next, set up the embeddings and model so that they are downloaded once the program starts. The model is the mosaicml/mpt-7b-chat that has been pruned to 50% and then quantized. For the embeddings, use sentence-transformers through Hugging Face:

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
embeddings = HuggingFaceEmbeddings(
    model_kwargs={"device": "cpu"},
llm = DeepSparse(
    model_config={"sequence_length": 2048},
    stop=["<|im_end|>", "<|endoftext|>"],

Upload your PDF

Chainlit provides tools for building LLM applications quickly.

In this case, when the application starts, the user is prompted to upload a PDF file. Once the file has been uploaded, perform the following tasks before the user can chat with the PDF:

  • Extract the contents of the PDF file.
  • Split the content into smaller pieces since the LLM has a maximum number of tokens it can accept, 2048 for MPT.
  • Create word embeddings for the split text.
  • Store the embeddings in a vector store, in this case, Chromadb.
  • Create a RetrievalQA chain that will use the Chromadb vector store.

All this functionality is bundled in a function that is decorated by cl.on_chat_start. The chain created in this function is saved for use in the next function.

Here is the entire function:

async def init():
    files = None

    # Wait for the user to upload a file
    while files == None:
        files = await cl.AskFileMessage(
            content="Please upload a pdf file to begin!",

    file = files[0]

    msg = cl.Message(content=f"Processing `{}`...")
    await msg.send()

    if file.type != "application/pdf":
        raise TypeError("Only PDF files are supported")

    pdf_stream = BytesIO(file.content)
    pdf = PyPDF2.PdfReader(pdf_stream)
    pdf_text = ""
    for page in pdf.pages:
        pdf_text += page.extract_text()

    texts = text_splitter.create_documents([pdf_text])
    for i, text in enumerate(texts):
        text.metadata["source"] = f"{i}-pl"

    # Create a Chroma vector store
    docsearch = Chroma.from_documents(texts, embeddings)
    # Create a chain that uses the Chroma vector store
    chain = RetrievalQA.from_chain_type(
    # Save the metadata and texts in the user session
    metadatas = [{"source": f"{i}-pl"} for i in range(len(texts))]
    cl.user_session.set("metadatas", metadatas)
    cl.user_session.set("texts", texts)

    # Let the user know that the system is ready
    msg.content = f"Processing `{}` done. You can now ask questions!"
    await msg.update()
    cl.user_session.set("chain", chain)

In the retriever, set return_source_documents to True to get the sources for the answers the LLM provides.

Chat with PDF

With all the building blocks in place, you are now ready to start receiving messages from the user. This is done by creating a function that is decorated by cl.on_message. This function accepts a message parameter. The function also grabs the chain created in the previous function via the user's session. Chainlit runs the function decorated by on_message when a user sends a message.

In this case, create an asynchronous callback handler, which allows you to call the chain via the acall_method. The LLM returns a dictionary containing the result and source_documents keys, which you can print on the screen.

Here is the entire function:

async def main(message: cl.Message):
    chain = cl.user_session.get("chain")  # type: RetrievalQAWithSourcesChain
    cb = cl.AsyncLangchainCallbackHandler(
        stream_final_answer=True, answer_prefix_tokens=["FINAL", "ANSWER"]
    cb.answer_reached = True
    res = await chain.acall(message.content, callbacks=[cb])

    answer = res["result"]
    source_documents = res["source_documents"]
    source_elements = []

    # Get the metadata and texts from the user session
    metadatas = cl.user_session.get("metadatas")
    all_sources = [m["source"] for m in metadatas]
    texts = cl.user_session.get("texts")

    if source_documents:
        found_sources = []

        # Add the sources to the message
        for source_idx, source in enumerate(source_documents):
            # Get the index of the source
            source_name = f"source_{source_idx}"
            # Create the text element referenced in the message
                cl.Text(content=str(source.page_content).strip(), name=source_name)

        if found_sources:
            answer += f"\nSources: {', '.join(found_sources)}"
            answer += "\nNo sources found"

    if cb.has_streamed_final_answer:
        cb.final_stream.content = answer
        cb.final_stream.elements = source_elements
        await cb.final_stream.update()
        await cl.Message(content=answer, elements=source_elements).send()

Final Thoughts

Deploying large language models on CPUs is now a reality. As you have read, you can use DeepSparse and LangChain to use with LLMs that have been optimized, to run fast on a CPU. These models enable anyone with a simple computer to start using LLMs. It also makes it faster for developers to iterate on solutions without configuring complicated hardware.

Neural Magic is excited to see what kind of LLM applications you will build on CPUs. Join us on Slack for any questions or submit an issue on GitHub. Check out the PDF chat demo on Hugging Face Spaces.