Last Updated on 2024-07-31 by Clay
Introduction
HuggingFace's Text Generation Inference (TGI) is a framework specifically designed to deploy and accelerate LLM inference services. Below is its architecture diagram:
On the outermost left side, at the API service level, the path opens to <url>/generate
. If the data is input from the API, it first enters a buffer and then is transferred to the backend model service by the Batcher. According to the official documentation, they use PagedAttention implemented by vLLM, with its core directly referencing vLLM.
For more on vLLM's accelerated inference mechanism, refer to:
- Using vLLM for High-Speed Inference of Large Language Models (LLM)
- Using vLLM To Accelerate Inference Speed By Continuous Batching
Usage
The most recommended method is to start with the official pre-packaged Docker Image, ensuring the correct environment configuration.
model=/data/teknium--OpenHermes-2.5-Mistral-7B/
volume=$PWD/data/
docker run \
--gpus device=1 \
--shm-size 10g \
-p 8080:80 \
-v $volume:/data/ \
ghcr.io/huggingface/text-generation-inference:2.0.4 \
--model-id $model
One thing that might not be clear from the official example is that $model
needs to be the location within the container. It makes sense upon closer consideration but can be confusing at first.
Once our TGI service is up and running, we can use the curl
command to send requests directly.
Sending Requests with CURL
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
Output:
{"generated_text":"\n\nDeep learning is a subset of machine learning in artificial intelligence (AI) that has networks called"}
Sending Requests with Python
Of course, we can also use any language we are familiar with to send requests, such as Python:
import httpx
url = "http://127.0.0.1:8080/generate"
headers = {
"Content-Type": "application/json"
}
data = {
"inputs": "What is Deep Learning?",
"parameters": {
"max_new_tokens": 20
}
}
response = httpx.post(url, json=data, headers=headers)
# Check response
if response.status_code == 200:
print("Response received successfully:")
print(response.json())
else:
print(f"Failed to get a response. Status code: {response.status_code}")
print(response.text)
Output:
Response received successfully:
{'generated_text': '\n\nDeep learning is a subset of machine learning in artificial intelligence (AI) that has networks called'}
Using InferenceClient from huggingface_hub to Send Requests
from huggingface_hub import InferenceClient
client = InferenceClient(model="http://127.0.0.1:8080")
client.text_generation(prompt="How are you today?")
Output:
'\n\nI’m doing well, thank you.\n\nWhat’s your name?\n'
For Streaming Generation...
from huggingface_hub import InferenceClient
client = InferenceClient(model="http://127.0.0.1:8080")
for token in client.text_generation("How are you today?", max_new_tokens=50, stream=True):
print(token, end="")
Output:
I’m doing well, thank you.
What’s your name?
My name is John.
Where are you from?
I’m from the United States.
What do you do?
Conclusion
Before using TGI, I was always a vLLM proponent. However, possibly due to different sampling methods, TGI often matches my fine-tuning needs better in terms of output format.
In summary, TGI will be a candidate for my accelerated inference framework due to its intuitive and easy startup and request sending processes. It should be suitable for testing new models (provided the TGI image update speed is fast enough for new model architectures).