Skip to content

Note Of HuggingFace Text Generation Inference (TGI)

Last Updated on 2024-07-31 by Clay

Introduction

HuggingFace's Text Generation Inference (TGI) is a framework specifically designed to deploy and accelerate LLM inference services. Below is its architecture diagram:

On the outermost left side, at the API service level, the path opens to <url>/generate. If the data is input from the API, it first enters a buffer and then is transferred to the backend model service by the Batcher. According to the official documentation, they use PagedAttention implemented by vLLM, with its core directly referencing vLLM.

For more on vLLM's accelerated inference mechanism, refer to:


Usage

The most recommended method is to start with the official pre-packaged Docker Image, ensuring the correct environment configuration.

model=/data/teknium--OpenHermes-2.5-Mistral-7B/
volume=$PWD/data/


docker run \
    --gpus device=1 \
    --shm-size 10g \
    -p 8080:80 \
    -v $volume:/data/ \
    ghcr.io/huggingface/text-generation-inference:2.0.4 \
    --model-id $model

One thing that might not be clear from the official example is that $model needs to be the location within the container. It makes sense upon closer consideration but can be confusing at first.

Once our TGI service is up and running, we can use the curl command to send requests directly.


Sending Requests with CURL

curl 127.0.0.1:8080/generate \
   -X POST \
   -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
   -H 'Content-Type: application/json'


Output:

{"generated_text":"\n\nDeep learning is a subset of machine learning in artificial intelligence (AI) that has networks called"}


Sending Requests with Python

Of course, we can also use any language we are familiar with to send requests, such as Python:

import httpx

url = "http://127.0.0.1:8080/generate"
headers = {
    "Content-Type": "application/json"
}
data = {
    "inputs": "What is Deep Learning?",
    "parameters": {
        "max_new_tokens": 20
    }
}

response = httpx.post(url, json=data, headers=headers)

# Check response
if response.status_code == 200:
    print("Response received successfully:")
    print(response.json())
else:
    print(f"Failed to get a response. Status code: {response.status_code}")
    print(response.text)


Output:

Response received successfully:
{'generated_text': '\n\nDeep learning is a subset of machine learning in artificial intelligence (AI) that has networks called'}


Using InferenceClient from huggingface_hub to Send Requests

from huggingface_hub import InferenceClient

client = InferenceClient(model="http://127.0.0.1:8080")
client.text_generation(prompt="How are you today?")


Output:

'\n\nI’m doing well, thank you.\n\nWhat’s your name?\n'


For Streaming Generation...

from huggingface_hub import InferenceClient

client = InferenceClient(model="http://127.0.0.1:8080")
for token in client.text_generation("How are you today?", max_new_tokens=50, stream=True): 
    print(token, end="")


Output:

I’m doing well, thank you.

What’s your name?

My name is John.

Where are you from?

I’m from the United States.

What do you do?

Conclusion

Before using TGI, I was always a vLLM proponent. However, possibly due to different sampling methods, TGI often matches my fine-tuning needs better in terms of output format.

In summary, TGI will be a candidate for my accelerated inference framework due to its intuitive and easy startup and request sending processes. It should be suitable for testing new models (provided the TGI image update speed is fast enough for new model architectures).


References


Read More

Leave a Reply