Last Updated on 2024-11-18 by Clay
最近嘗試實作了許多推測性解碼(Speculative Decoding)的加速方法,而 HuggingFace 的 transformers
套件中自然也有對應的加速方法 assistant_model
,今天就趁這個機會一起紀錄下來。
不過需要注意的是,在要使用這些方法前,建議先開個 Python 虛擬環境並把 transformers
升級到最新版本。
`assistant_model` 使用方法
如果是想要了解 Speculative Decoding 的原理,可以參考原始論文:Fast Inference from Transformers via Speculative Decoding
或是我的筆記:[論文閱讀] Fast Inference from Transformers via Speculative Decoding
而若是想要在 transformers
中使用 Speculative Decoding 技術也非常地簡單,我們可以在模型使用 .generate()
方法進行解碼時,透過 assistant_model
這個參數傳遞 draft model 進去進行加速解碼。
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def main() -> None:
# Settings
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
target_model_path = "./models/HuggingFaceTB--SmolLM2-1.7B-Instruct"
draft_model_path = "./models/HuggingFaceTB--SmolLM2-135M-Instruct"
# Load Tokenizer
draft_tokenizer = AutoTokenizer.from_pretrained(draft_model_path)
target_tokenizer = AutoTokenizer.from_pretrained(target_model_path)
# Load Model
draft_model = AutoModelForCausalLM.from_pretrained(draft_model_path, torch_dtype=torch.bfloat16).to(device)
target_model = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.bfloat16).to(device)
# Tokenizer
messages = [
[
{
"role": "user",
"content": "What is the capital of Taiwan. And why?",
},
],
]
# Tokenize
input_text=target_tokenizer.apply_chat_template(messages, tokenize=False)
inputs = draft_tokenizer(
input_text,
return_tensors="pt",
max_length=512,
truncation=True,
padding=True,
).to(device)
# Target Model Generate Directly
start_time = time.time()
outputs = target_model.generate(**inputs, max_new_tokens=100)
generated_token_num = outputs.shape[-1] - inputs["input_ids"].shape[-1]
print("=== Directly Generate ===")
print(f"Generated Tokens: {generated_token_num}")
print(f"Spent Time: {time.time() - start_time} seconds.\n")
# Speculative Decoding
print("=== Speculative Decoding ===")
start_time = time.time()
outputs = target_model.generate(
**inputs,
max_new_tokens=100,
assistant_model=draft_model,
)
generated_token_num = outputs.shape[-1] - inputs["input_ids"].shape[-1]
print(f"Generated Tokens: {generated_token_num}")
print(f"Spent Time: {time.time() - start_time} seconds.")
if __name__ == "__main__":
main()
Output:
=== Directly Generate ===
Generated Tokens: 100
Spent Time: 1.9954736232757568 seconds.
=== Speculative Decoding ===
Generated Tokens: 100
Spent Time: 1.9073119163513184 seconds.
不過,由於我測試的量級大小了,比較測試不出加速的提昇。
現在已經支援不使用同樣的 Tokenizer 一樣可以做 Speculative Decoding
另外,以前我們都會說,要讓 draft model 能夠加強 target model,兩者需要享有同樣的詞彙表,換言之也就是需要使用相同的 tokenizer —— 這是因為本來的採樣方法進行驗證時,會需要針對同樣位置的 token 進行機率分佈的驗證。
不過現在 HuggingFace 其實是支援不同詞彙表的 draft model 來加速解碼的哦!詳情可以查看 Universal Assisted Generation: Faster Decoding with Any Assistant Model 這篇文章。
概念其實非常簡單:我們可以先把 draft model 解碼的結果 decode 回字串、再使用 target tokenizer 解碼回 Tokens,就可以讓 target model 進行驗證了!不過當然這樣的話我們目前只能使用 token 做 greedy search 的驗證,而沒辦法基於論文中提出來的 Speculative Sampling 進行驗證。
不過這或許真的是一個非常有用的技術,畢竟有許多需要被加速的大模型,他們未必都有一個享有相同詞彙表的小尺寸版本。
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def main() -> None:
# Settings
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
target_model_path = "./models/HuggingFaceTB--SmolLM2-1.7B-Instruct"
draft_model_path = "./models/openai-community--gpt2"
# Load Tokenizer
draft_tokenizer = AutoTokenizer.from_pretrained(draft_model_path)
target_tokenizer = AutoTokenizer.from_pretrained(target_model_path)
print(f"draft tokenizer vocab size: {len(draft_tokenizer)}")
print(f"target tokenizer vocab size: {len(target_tokenizer)}\n")
# Load Model
draft_model = AutoModelForCausalLM.from_pretrained(draft_model_path, torch_dtype=torch.bfloat16).to(device)
target_model = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.bfloat16).to(device)
# Tokenizer
messages = [
[
{
"role": "user",
"content": "What is the capital of Taiwan. And why?",
},
],
]
# Tokenize
input_text=target_tokenizer.apply_chat_template(messages, tokenize=False)
inputs = draft_tokenizer(
input_text,
return_tensors="pt",
max_length=512,
truncation=True,
padding=True,
).to(device)
# Target Model Generate Directly
start_time = time.time()
outputs = target_model.generate(**inputs, max_new_tokens=100)
generated_token_num = outputs.shape[-1] - inputs["input_ids"].shape[-1]
print("=== Directly Generate ===")
print(f"Generated Tokens: {generated_token_num}")
print(f"Spent Time: {time.time() - start_time} seconds.\n")
# Speculative Decoding
print("=== Speculative Decoding ===")
start_time = time.time()
outputs = target_model.generate(
**inputs,
max_new_tokens=100,
assistant_model=draft_model,
tokenizer=target_tokenizer,
assistant_tokenizer=draft_tokenizer,
)
generated_token_num = outputs.shape[-1] - inputs["input_ids"].shape[-1]
print(f"Generated Tokens: {generated_token_num}")
print(f"Spent Time: {time.time() - start_time} seconds.")
if __name__ == "__main__":
main()
Output:
draft tokenizer vocab size: 50257
target tokenizer vocab size: 49152
=== Directly Generate ===
Generated Tokens: 100
Spent Time: 2.0288448333740234 seconds.
=== Speculative Decoding ===
Generated Tokens: 100
Spent Time: 2.306903839111328 seconds.
可以看到我們確實可以使用兩個擁有不同詞彙表的 draft model 和 target model,並能用於 Speculative Decoding,但是我們就需要額外傳入兩者的 Tokenizers 了。畢竟,現在在生成驗證階段會用到兩者的 Tokenizers 進行編碼與再次解碼。
References
- Fast Inference from Transformers via Speculative Decoding
- Universal Assisted Generation: Faster Decoding with Any Assistant Model