Integrating DeepSparse With OpenAI’s API for Fast Local LLMs

Nov 02, 2023

Author(s)

Michael Goin

Engineering Lead, Neural Magic

Since OpenAI's introduction of ChatGPT, developers worldwide have widely embraced the OpenAI API as the go-to solution for making API requests to their language models. However, in response to the growing demand within open-source communities for more accessible and cost-effective language model alternatives, users have started to explore the integration of DeepSparse with OpenAI's API. This integration allows you to boost CPU performance for large language model (LLM) inference for machine learning applications without the need to modify your existing OpenAI code. It’s been developed to align with OpenAI’s API specifications, so local inferencing is effortless. With this integration, you gain the flexibility to deploy LLMs on your own terms. Whether it is on your local machine or on-premises, you can harness the power of LLMs using readily available consumer-grade CPUs.

To illustrate the practicality of this integration, we will walk you through a step-by-step example of how to set up a server with OpenAI DeepSparse integration with LinkedIn’s API. This allows you to programmatically create LinkedIn posts that use a language model, to demonstrate the capabilities and versatility that this solution can bring to your content creation process.

How to Use OpenAI API With DeepSparse

To run Neural Magic’s LLM models via the OpenAI API, use the DeepSparse Server CLI command with the “--integration openai” flag:

deepsparse.server task text_generation  \
--model_path "hf:mgoin/llama2.c-stories15M-ds" \
--integration openai

This integration enables access to OpenAI endpoints, such as '/v1/models' and '/v1/chat/completions'. You can send inference requests through standard tools like 'curl' or 'requests', or through the OpenAI API. Additionally, code examples are offered to make OpenAI API requests both in Python and via 'curl' commands.

PRO TIP: Alternatively, you can start the DeepSparse Server using the deepsparse.openai CLI command with an accompanying sample config file with server configuration:

deepsparse.openai sample_config.yaml

Making Requests to the Server Using Python or CLI

Use OpenAI's Python API, as you would with ChatGPT and other OpenAI models, to call the Server once it is running; the OpenAI API key is not required. Pass the DeepSparse Server’s route to the api_base configuration parameter by using http://localhost:5543/v1.

You will configure the OpenAI library to make API requests to the running Server. Here is an example request using the Llama 2 model trained on the Tiny Stories dataset:

import openai

openai.api_key = "EMPTY"
openai.api_base = "http://localhost:5543/v1"

completion = openai.ChatCompletion.create(
    messages="Once upon a time",
    stream=False,
    max_tokens=30,
    model="hf:mgoin/llama2.c-stories15M-ds",
)

print("Chat results:")
print(completion)

Here is the output in OpenAI format:

{
  "id": "cmpl-1b76ca9192ad40c39bce094c74937313",
  "object": "chat.completion",
  "created": 1698173139,
  "model": "hf:mgoin/llama2.c-stories15M-ds",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "He was very hungry and wanted to eat something yummy. He saw a big, red apple on the ground and he was so excited. He picked it up and took a big bite. It was so delicious!\nSuddenly, a big, mean dog came running towards him. Michael was scared and he started to cry. The dog barked and growled at Michael. He was so scared that he dropped the apple!"
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 2,
    "total_tokens": 4,
    "completion_tokens": 2
  }
}

Alternatively to Python code, you can send a request via CLI with cURL:

curl http://localhost:5543/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "hf:mgoin/llama2.c-stories15M-ds",
        "messages": "Once upon a time",
        "max_tokens": 30,
        "n": 2,
        "stream": false
    }'

Using the DeepSparse OpenAI Server to Submit LinkedIn Posts

Put the DeepSparse OpenAI integration to the test by using the Server to submit posts on LinkedIn using a Llama 2 model trained on Tiny Stories. Install OpenAI and the DeepSparse LLM packages:

pip install openai deepsparse-nightly[server,llm]

Next, start the Server:

deepsparse.server task text_generation  \
--model_path "hf:mgoin/llama2.c-stories15M-ds" \
--integration openai

To make requests to the LinkedIn API, you will need two API tokens: the LinkedIn access token in order to submit posts, and your profile token to link your LinkedIn profile to the API. In order to get access to the required API tokens, create an application on LinkedIn’s developer site.

The script includes a function to start the server:

call_server: Takes several parameters, including a prompt, API key, model name, maximum tokens, and other settings. This function makes an API call to a server using the OpenAI API, generates text based on the provided prompt, and returns the generated text.

And a LinkedIn class for sending LLM-generated content:

params: This method combines the LLM-generated content with LinkedIn API parameters to create and publish a post on LinkedIn via the requests library. It constructs the necessary payload and headers, including the LinkedIn access token for authentication.

import openai
import requests

ACCESS_TOKEN = 'YOUR-TOKEN'
PROFILE_TOKEN = 'YOUR-TOKEN'

def call_server(prompt):
    
    openai.api_key = "EMPTY"
    openai.api_base = "http://localhost:5543/v1"

    completion = openai.ChatCompletion.create(
        messages=prompt,
        stream=False,
        max_tokens=100,
        model="hf:mgoin/llama2.c-stories15M-ds",
        n=2,
        do_sample=True,
    )

    return completion

class Linkedin:
    profile_id = f'urn:li:person:{PROFILE_TOKEN}'
    url = 'https://api.linkedin.com/v2/ugcPosts'

    def params(self, payload):
        payload = {
            "author": self.profile_id,
            "lifecycleState": "PUBLISHED",
            "specificContent": {
                "com.linkedin.ugc.ShareContent": {
                    "shareCommentary": {
                        "text": payload
                    },
                    "shareMediaCategory": "NONE",
                },
            },
            "visibility": {
                "com.linkedin.ugc.MemberNetworkVisibility": "PUBLIC"
            }
        }

        headers = {
            'Content-Type': 'application/json',
            'X-Restli-Protocol-Version': '2.0.0',
            'Authorization': 'Bearer ' + ACCESS_TOKEN
        }

        return payload, headers
    
def main():
    prompt = "Once upon a time"
    output = call_server(prompt)
    payload = prompt + " " + output.choices[0].message.content
    print(payload)
    payload, headers = Linkedin().params(payload)
    execute_output = input("\nShare post? (y/n): ").lower() == 'y'
    if execute_output:
        print("sending")
        return requests.post(Linkedin.url, headers=headers, json=payload)
    else:
        main()

main()

Here is how it looks in practice:

Conclusion

In this blog post, we cover the integration of DeepSparse Server with the OpenAI API, a significant development that allows you to run Sparse LLMs with efficiency and minimal code maintenance on local machines. As the OpenAI API has become a standard choice for interacting with language models like ChatGPT, the demand for open-source LLM alternatives has grown within the developer community.

This integration seamlessly incorporates the DeepSparse Server into existing OpenAI API deployments, enabling efficient CPU-boosted LLM inference for machine learning applications. We outlined the integration process, how to configure the DeepSparse Server, and demonstrated how to make API requests via OpenAI's Python API. Additionally, we showcased an educational use case where you can programmatically generate LinkedIn posts using a Sparse Llama 2 model trained on Tiny Stories.

This integration enhances machine learning and content creation workflows while tapping into the potential of Sparse LLMs for a variety of applications. To keep up with our mission of efficient software-delivered AI, join us in the Neural Magic Community Slack for any questions and subscribe to our monthly newsletter below.

Was this article helpful?

YesNo

Author(s)

Michael Goin

Engineering Lead, Neural Magic