Last Updated on 2024-11-20 by Clay
Recently, I attempted to implement various speculative decoding acceleration methods. HuggingFace's transformers
library also provides a corresponding acceleration feature called assistant_model
. Today, let me take this opportunity to document it.
Before using these methods, it is recommended to create a Python virtual environment and upgrade the transformers
library to the latest version.
`assistant_model` Usage
If you want to understand the principles of Speculative Decoding, you can refer to the original paper: Fast Inference from Transformers via Speculative Decoding
Or check out my notes: [Paper Reading] Fast Inference from Transformers via Speculative Decoding
To use Speculative Decoding in the transformers
library is also quite simple. While using the .generate()
method for decoding, you can pass a draft model through the assistant_model
parameter to accelerate decoding.
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def main() -> None:
# Settings
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
target_model_path = "./models/HuggingFaceTB--SmolLM2-1.7B-Instruct"
draft_model_path = "./models/HuggingFaceTB--SmolLM2-135M-Instruct"
# Load Tokenizer
draft_tokenizer = AutoTokenizer.from_pretrained(draft_model_path)
target_tokenizer = AutoTokenizer.from_pretrained(target_model_path)
# Load Model
draft_model = AutoModelForCausalLM.from_pretrained(draft_model_path, torch_dtype=torch.bfloat16).to(device)
target_model = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.bfloat16).to(device)
# Tokenizer
messages = [
[
{
"role": "user",
"content": "What is the capital of Taiwan. And why?",
},
],
]
# Tokenize
input_text=target_tokenizer.apply_chat_template(messages, tokenize=False)
inputs = draft_tokenizer(
input_text,
return_tensors="pt",
max_length=512,
truncation=True,
padding=True,
).to(device)
# Target Model Generate Directly
start_time = time.time()
outputs = target_model.generate(**inputs, max_new_tokens=100)
generated_token_num = outputs.shape[-1] - inputs["input_ids"].shape[-1]
print("=== Directly Generate ===")
print(f"Generated Tokens: {generated_token_num}")
print(f"Spent Time: {time.time() - start_time} seconds.\n")
# Speculative Decoding
print("=== Speculative Decoding ===")
start_time = time.time()
outputs = target_model.generate(
**inputs,
max_new_tokens=100,
assistant_model=draft_model,
)
generated_token_num = outputs.shape[-1] - inputs["input_ids"].shape[-1]
print(f"Generated Tokens: {generated_token_num}")
print(f"Spent Time: {time.time() - start_time} seconds.")
if __name__ == "__main__":
main()
Output:
=== Directly Generate ===
Generated Tokens: 100
Spent Time: 1.9954736232757568 seconds.
=== Speculative Decoding ===
Generated Tokens: 100
Spent Time: 1.9073119163513184 seconds.
However, due to the scale of my tests, it’s hard to observe a noticeable improvement in speed.
Support for Different Tokenizers in Speculative Decoding
Previously, it was necessary for the draft model to enhance the target model by sharing the same vocabulary, meaning they had to use the same tokenizer. This was because the original sampling method required probability verification at the same token positions.
Now, HuggingFace supports using draft models with different vocabularies to accelerate decoding! For details, check out Universal Assisted Generation: Faster Decoding with Any Assistant Model.
The concept is straightforward: we can decode the draft model's output back into text, then use the target tokenizer to tokenize it into tokens, allowing the target model to perform verification. However, this approach limits us to greedy search verification for tokens, and we cannot perform Speculative Sampling as proposed in the paper.
Nonetheless, this is a highly practical technique, as many large models in need of acceleration may not have smaller versions sharing the same vocabulary.
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def main() -> None:
# Settings
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
target_model_path = "./models/HuggingFaceTB--SmolLM2-1.7B-Instruct"
draft_model_path = "./models/openai-community--gpt2"
# Load Tokenizer
draft_tokenizer = AutoTokenizer.from_pretrained(draft_model_path)
target_tokenizer = AutoTokenizer.from_pretrained(target_model_path)
print(f"draft tokenizer vocab size: {len(draft_tokenizer)}")
print(f"target tokenizer vocab size: {len(target_tokenizer)}\n")
# Load Model
draft_model = AutoModelForCausalLM.from_pretrained(draft_model_path, torch_dtype=torch.bfloat16).to(device)
target_model = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.bfloat16).to(device)
# Tokenizer
messages = [
[
{
"role": "user",
"content": "What is the capital of Taiwan. And why?",
},
],
]
# Tokenize
input_text=target_tokenizer.apply_chat_template(messages, tokenize=False)
inputs = draft_tokenizer(
input_text,
return_tensors="pt",
max_length=512,
truncation=True,
padding=True,
).to(device)
# Target Model Generate Directly
start_time = time.time()
outputs = target_model.generate(**inputs, max_new_tokens=100)
generated_token_num = outputs.shape[-1] - inputs["input_ids"].shape[-1]
print("=== Directly Generate ===")
print(f"Generated Tokens: {generated_token_num}")
print(f"Spent Time: {time.time() - start_time} seconds.\n")
# Speculative Decoding
print("=== Speculative Decoding ===")
start_time = time.time()
outputs = target_model.generate(
**inputs,
max_new_tokens=100,
assistant_model=draft_model,
tokenizer=target_tokenizer,
assistant_tokenizer=draft_tokenizer,
)
generated_token_num = outputs.shape[-1] - inputs["input_ids"].shape[-1]
print(f"Generated Tokens: {generated_token_num}")
print(f"Spent Time: {time.time() - start_time} seconds.")
if __name__ == "__main__":
main()
Output:
draft tokenizer vocab size: 50257
target tokenizer vocab size: 49152
=== Directly Generate ===
Generated Tokens: 100
Spent Time: 2.0288448333740234 seconds.
=== Speculative Decoding ===
Generated Tokens: 100
Spent Time: 2.306903839111328 seconds.
We can see that it is indeed possible to use draft and target models with different vocabularies for Speculative Decoding, but this requires specifying their respective tokenizers. During generation and verification, both tokenizers are used for encoding and decoding.
References
- Fast Inference from Transformers via Speculative Decoding
- Universal Assisted Generation: Faster Decoding with Any Assistant Model