Accelerate Hugging Face Inference Endpoints with DeepSparse


A well known burden with today’s cloud providers is the ever-growing necessity for technical experts to handle their infrastructure. The unfortunate consequence of this challenge can ultimately lead to long deployment times impacting the iteration rate on your model’s journey to production. Hugging Face 🤗 Inference Endpoints is a new service for automating the deployment of deep learning inference pipelines behind public cloud providers such as AWS, Azure, and GCP.  You can deploy a model directly from a Hugging Face Model repository, choose the cloud provider’s instance type, and generate an inference endpoint in minutes without the worry of managing the backend infrastructure as usually required by the public cloud.

In this blog, we’ll outline how to quickly deploy a sentiment analysis pipeline as a Hugging Face Inference Endpoint and benchmark the performance of a sparse DistilBERT running in DeepSparse against its PyTorch variant encompassing various CPU and GPU instance types.

Keep in mind, you need to set up a billing account to use Hugging Face’s Inference Endpoints.


Building and invoking your endpoint requires the following steps:

  1. Installing Git LFS on your local machine.
  2. Downloading a Sparse transformer and selecting deployment files from the SparseZoo.
  3. Creating a custom file for handling inferencing on the Hugging Face platform.
  4. Pushing the deployment files up to the Models Hub.
  5. Picking your instance on the Hugging Face Endpoints UI Platform.
  6. Calling the endpoint by cURL or Python.

PRO TIP: We’ve already created an example model repository on the Hugging Face Models Hub holding all the required files if you wish to skip the step-by-step approach below. Don’t forget to give the repository a “❤️ like”!

Step 1: Installing Git LFS

Install Git LFS on your local machine to communicate with the Models Hub from your development environment:

sudo apt-get update
sudo apt-get install git-lfs

Step 2: Downloading Deployment Files from the SparseZoo

The next step is to create a model repository on the Models Hub and clone it to your local machine:

git clone<user>/<model-name>
cd <model-name>

Now make an API call to Neural Magic’s SparseZoo to download the deployment directory of the Sparse Transfer 80% VNNI Pruned DistilBERT model into our development environment. This model is the result of pruning the DistilBERT model to 80% using the VNNI blocking (semi-structured), followed by fine-tuning and quantization on the SST2 dataset.

Start by installing SparseZoo:

pip install sparsezoo

To download the model’s weights and configuration files into our current path, you can run the following Python script to obtain the convenient deployment directory from the SparseZoo. Be sure to include your download path in the script below:

from sparsezoo import Model

stub = "zoo:nlp/sentiment_analysis/distilbert-none/pytorch/huggingface/sst2/pruned80_quant-none-vnni"
model = Model(stub, download_path="<your local path>")

# Downloads and prints the download path of the model 

After the download is complete, the deployment directory should appear in your environment, with four files. Paste these files into your current path:

  • ONNX file - model.onnx
  • model config - config.json
  • tokenizer - tokenizer.json
  • tokenizer config file - tokenizer_config.json 

Then delete the deployment directory.

Step 3: Adding DeepSparse Pipeline to the Endpoint Handler

For Inference Endpoints to make a prediction for a custom deployment (i.e., not PyTorch or Tensorflow), a file is required. This file needs to include an EndpointHandler class with  __init__ and  __call__ methods. Additionally, a requirements.txt file is required for adding any custom dependencies you need at runtime. For our example, we only need the deepsparse library:


Now build the file with a EndpointHandler class and add it to the current path. In the __init__ constructor, we’ve added the DeepSparse Pipeline and the __call__ method is where we’ll call the pipeline and run the perf_counter function for measuring latency. This is how it looks:

from typing import Dict, Any
from deepsparse import Pipeline
from time import perf_counter

class EndpointHandler:

    def __init__(self, path=""):

        self.pipeline = Pipeline.create(

    def __call__(self, data: Dict[str, Any]) -> Dict[str, str]:
            data (:obj:): prediction input text
        inputs = data.pop("inputs", data)

        start = perf_counter()
        prediction = self.pipeline(inputs)
        end = perf_counter()
        latency = end - start

        return {
            "labels": prediction.labels, 
            "scores": prediction.scores,
            "latency (secs.)": latency

You now have all the necessary files required to set up your endpoint: the file, requirements.txt, and the deployment directory files (model.onnx, config.json, tokenizer.json, and tokenizer_config.json).

Step 4: Pushing the Deployment Files Up to the Models Hub

To push up the files to your remote repository in the Models Hub, set up a User Access Token from your Settings page. The User Access Token is used to authenticate your identity to the Hub.

Install the Hub client with the CLI dependency:

pip install 'huggingface_hub[cli]'

Once you have your User Access Token, run the following command in your terminal:

huggingface-cli login
git config --global credential.helper store

Now you can run the following Git commands to add, commit, and push your files:

git add -A
git commit -m “push files”
git push

Step 5: Pick your Instance on the Hugging Face Endpoints UI Platform

You have all the files required to get the endpoint up and running, so set your endpoint configuration on the Endpoints platform. You’ll select an AWS instance with two vCPUs and 4GB of RAM in the us-east-1 region given that DeepSparse runs on CPUs with GPU speeds.

Follow this video to learn about staging an endpoint from your model repository and using the advanced configuration in the Endpoints UI for selecting hardware or scaling up replicas:

After your endpoint configuration is staged successfully, you’ll see the green Running instance logo and your endpoint URL will be displayed. You are now ready to start inferencing!

Endpoint configuration on the Endpoints platform

Step 6: Calling the Endpoint by cURL or Python

You have two options for making an inference invocation: using the cURL command or using a Python script.

The cURL command accepts your endpoint URL, your inputs as the data, and the bearer token (necessary when using private endpoints). For gaining access to the bearer token, select Copy as cURL at the bottom of the endpoint page UI (displayed above) to get the following cURL command:

curl \
-d '{"inputs": "I like pizza."}' \
-H "Authorization: Bearer nKXRoCDkVPMyLGnmKGfFcDTJGvDvUUsuDMzvuMkfdQqrvrrJJZQausaaplNcWkuyTIwKayAdFaHHAtWyOTfLBwpdDuFCEFXKBTHTJTyQYaSje" \
-H "Content-Type: application/json"

Paste this in a local terminal and the cURL command will call the handler for a prediction of your input’s sentiment and latency. The output should look something like this:

["{\"labels\": [\"positive\"], \"scores\": [0.9992029666900635]}","latency: 0.0036253420112188905 secs."]

To make an inference invocation via Python, you can use the requests library. Similar to the cURL command, you’ll add to the script the endpoint URL, the bearer token, and sample input text:

import json
import requests as r


payload = { "inputs": "I like pizza." }

headers= {
    "Authorization": f"Bearer {“<TOKEN>”}",
    "Content-Type": "application/json"

response =, headers=headers, json=payload)
prediction = response.json()


The expected output in the response is:

["{"labels": ["positive"], "scores": [0.9992029666900635]}","latency: 0.0036253420112188905 secs."]


Let's benchmark the sparse DistilBERT running in DeepSparse against its dense PyTorch variant to notice the differences in performance on the medium two vCPU 4GB instances and a single NVIDIA T4 GPU in the us-east-1 region of AWS.

In the table below, note a dramatic improvement in the average latency with only two vCPUs! DeepSparse fully utilizes the available CPU hardware to achieve an average latency 43X faster than PyTorch! 

DeepSparse Cost Efficiency on Inference Endpoint

It’s even faster than a T4 GPU while being 5X cheaper resulting in a 87% cost reduction!

batch size=1, seq_length=128

VersionHardwareAvg. Latency (ms.)Cost / MonthCost / Million Inferences
Dense PyTorch CPU2 vCPUs 4GBIntel Ice Lake406.72$87.61$13.56
Dense PyTorch GPU1x NVIDIA T4 GPU14.30$438.05$2.38
Sparse DeepSparse CPU2 vCPUs 4GB Intel Ice Lake9.44$87.61$0.31

Final Thoughts

We showed how easy it is to set up an HTTP endpoint using the Hugging Face Inference Endpoints platform with DeepSparse. Additionally, we benchmarked the inference performance of Neural Magic’s DistilBERT against its dense PyTorch variant to highlight the dramatic performance improvement and cost reduction when using DeepSparse and a bit of Neural Magic. 

For more on Neural Magic’s open-source codebase, view the GitHub repositories (DeepSparse and SparseML). For Neural Magic Support, sign up or log in to get help with your questions in our community Slack. Bugs, feature requests, or additional questions can also be posted to our GitHub Issue Queue.