Skip to content

Using vLLM To Accelerate The Decoding Of Large Language Model

Last Updated on 2023-12-14 by Clay

Introduction

vLLM is a large language model (LLM) acceleration framework that developed by the research team of University of California, Berkeley. It is used PagedAttention to increase the usage rate of GPU VRAM, and this method do not change the model architecture.

PagedAttention is inspired by the classic virtual memory and paging technology in operating system, so it is named Paged. The most important implement is, it can be store the continuous key and value of Transformer attention mechanism at the non-continuous memory.

PagedAttention split every sequence of KV cache to be Block, and the token number is fixed at the block.

This method get a faster speed compare with other methods.

As we can see, in different conditions, vLLM have a obviously faster inference speed.


Usage

First, we need to install the vllm package; If you want to use AWQ quantized model, you need to install autoawq too.

pip3 install vllm autoawq


By the way, I originally want to test the Mistral fine-tuned model OpenHermes, but I found I have no enough VRAM on my local device so I need to use AWQ quantized model, but no matter I loaded the model, I always got the error message.

After checking, I found the reason is vllm does not yet support many AWQ architecture.

It only support the following:

So I just test Llama.

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


Output:

Prompt: 'Hello, my name is', Generated text: " Sherry and I'm a 35-year-old woman from"
Prompt: 'The president of the United States is', Generated text: ' a member of the executive branch of the federal government. The president serves as the'
Prompt: 'The capital of France is', Generated text: ' Paris, which is known for its stunning architecture, art museums, historical'
Prompt: 'The future of AI is', Generated text: ' exciting and uncertain. Here are some potential developments that could shape the field'


This kind of reasoning only takes 10 seconds including model loading time! If the model is removed, it only takes 0.3 seconds to generate just a few tokens!

Wow it is too fast! I hope I can find some enough time to study the detail of their implement...

By the way, the following is a official quantization script of AWQ:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True})
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)


However, the official document also states that although the current vLLM supports AWQ, it has not yet been optimized, so the inference speed will be slower than that without quantification. That said, the AWQ model I tested above, the unquantized version is actually just faster.


References


Read More

Leave a Reply