Last Updated on 2024-07-31 by Clay
Introduction
I previously wrote a note introducing the vLLM accelerated inference framework (Using vLLM To Accelerate The Decoding Of Large Language Model), but due to space and time constraints, I couldn’t delve into more detailed features.
In addition to using vLLM as an accelerated LLM inference framework for research purposes, vLLM also implements a more powerful feature — the Continuous Batching inference technique (also known as Rolling Batch or Continually Batching, but often used interchangeably).
While this method may not show a noticeable difference during personal research on LLM effectiveness, it becomes very important in production deployment scenarios — as it provides a very smooth user experience, allowing almost instant feedback from LLM generation without any wait time, given sufficient GPU memory.
First, in the inference we usually perform with the transformers package, we are essentially using a static batching mode. Below is a reproduction of a widely circulated diagram.
Suppose we are generating four different sentences. In static batching mode, the model generates the next token for all sentences in the batch at once. We need to wait until T8 to get the complete results for all sentences.
The empty parts in the diagram represent the wasted computation time.
However, if during the LLM inference process, we check whether new generation requests can be added each time a new token is decoded, we can allocate the computing resources of already decoded sentences to dynamically added new generation data while decoding other sentences that are still being generated (as shown in S5, S6, S7 in the diagram below).
From a product perspective, users can get their generated data back immediately once it’s ready, without waiting for the longest sequence in the same batch to finish, making it a win-win situation.
vLLM has already implemented such an asynchronous inference engine, making it very convenient to use directly.
Usage
First, ensure that the vllm
package is installed.
pip3 install vllm
After that, you can start the vLLM API service with the following command:
python3 -m vllm.entrypoints.api_server --model facebook/opt-125m
For testing purposes, I chose the facebook/opt-125m model. Naturally, we shouldn’t expect much from this scale of a model. After starting, the default port should be 8000.
We can use the following command directly in the terminal:
curl \
-X POST \
-H "User-Agent: Test Client" \
-H "Content-Type: application/json" \
-d '{
"prompt": "San Francisco is a",
"n": 4,
"use_beam_search": true,
"temperature": 0.0,
"max_tokens": 16,
"stream": true
}' \
http://localhost:8000/generate
This will provide a streaming return format.
If you prefer using Python, you can try testing the vLLM API with the following script:
curl \
-X POST \
-H "User-Agent: Test Client" \
-H "Content-Type: application/json" \
-d '{
"prompt": "San Francisco is a",
"n": 4,
"use_beam_search": true,
"temperature": 0.0,
"max_tokens": 16,
"stream": true
}' \
http://localhost:8000/generate
Since the vLLM API itself handles requests asynchronously, you can send multiple requests simultaneously and should see that each request almost always returns the generated results immediately, providing a very good experience.
Next, if I have the chance, I would like to share how to integrate vLLM with a streaming interface like Gradio.
References
- How continuous batching enables 23x throughput in LLM inference while reducing p50 latency
- vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention