Using the `assistant_model` method in HuggingFace's `transformers` library to accelerate Speculative Decoding

Last Updated on 2024-11-20 by Clay

Recently, I attempted to implement various speculative decoding acceleration methods. HuggingFace's transformers library also provides a corresponding acceleration feature called assistant_model. Today, let me take this opportunity to document it.

Before using these methods, it is recommended to create a Python virtual environment and upgrade the transformers library to the latest version.

`assistant_model` Usage

If you want to understand the principles of Speculative Decoding, you can refer to the original paper: Fast Inference from Transformers via Speculative Decoding

Or check out my notes: [Paper Reading] Fast Inference from Transformers via Speculative Decoding

To use Speculative Decoding in the transformers library is also quite simple. While using the .generate() method for decoding, you can pass a draft model through the assistant_model parameter to accelerate decoding.

import time

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


def main() -> None:
    # Settings
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    target_model_path = "./models/HuggingFaceTB--SmolLM2-1.7B-Instruct"
    draft_model_path = "./models/HuggingFaceTB--SmolLM2-135M-Instruct"

    # Load Tokenizer
    draft_tokenizer = AutoTokenizer.from_pretrained(draft_model_path)
    target_tokenizer = AutoTokenizer.from_pretrained(target_model_path)

    # Load Model
    draft_model = AutoModelForCausalLM.from_pretrained(draft_model_path, torch_dtype=torch.bfloat16).to(device)
    target_model = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.bfloat16).to(device)

    # Tokenizer
    messages = [
        [
            {
                "role": "user",
                "content": "What is the capital of Taiwan. And why?",
            },
        ],
    ]


    # Tokenize
    input_text=target_tokenizer.apply_chat_template(messages, tokenize=False)
    inputs = draft_tokenizer(
        input_text,
        return_tensors="pt",
        max_length=512,
        truncation=True,
        padding=True,
    ).to(device)

    # Target Model Generate Directly
    start_time = time.time()
    outputs = target_model.generate(**inputs, max_new_tokens=100)

    generated_token_num = outputs.shape[-1] - inputs["input_ids"].shape[-1]

    print("=== Directly Generate ===")
    print(f"Generated Tokens: {generated_token_num}")
    print(f"Spent Time: {time.time() - start_time} seconds.\n")

    # Speculative Decoding
    print("=== Speculative Decoding ===")
    start_time = time.time()
    outputs = target_model.generate(
        **inputs,
        max_new_tokens=100,
        assistant_model=draft_model,
    )

    generated_token_num = outputs.shape[-1] - inputs["input_ids"].shape[-1]

    print(f"Generated Tokens: {generated_token_num}")
    print(f"Spent Time: {time.time() - start_time} seconds.")


if __name__ == "__main__":
    main()

Output:

=== Directly Generate ===
Generated Tokens: 100
Spent Time: 1.9954736232757568 seconds.

=== Speculative Decoding ===
Generated Tokens: 100
Spent Time: 1.9073119163513184 seconds.

However, due to the scale of my tests, it’s hard to observe a noticeable improvement in speed.

Support for Different Tokenizers in Speculative Decoding

Previously, it was necessary for the draft model to enhance the target model by sharing the same vocabulary, meaning they had to use the same tokenizer. This was because the original sampling method required probability verification at the same token positions.

Now, HuggingFace supports using draft models with different vocabularies to accelerate decoding! For details, check out Universal Assisted Generation: Faster Decoding with Any Assistant Model.

The concept is straightforward: we can decode the draft model's output back into text, then use the target tokenizer to tokenize it into tokens, allowing the target model to perform verification. However, this approach limits us to greedy search verification for tokens, and we cannot perform Speculative Sampling as proposed in the paper.

Nonetheless, this is a highly practical technique, as many large models in need of acceleration may not have smaller versions sharing the same vocabulary.

import time

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


def main() -> None:
    # Settings
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    target_model_path = "./models/HuggingFaceTB--SmolLM2-1.7B-Instruct"
    draft_model_path = "./models/openai-community--gpt2"

    # Load Tokenizer
    draft_tokenizer = AutoTokenizer.from_pretrained(draft_model_path)
    target_tokenizer = AutoTokenizer.from_pretrained(target_model_path)
    print(f"draft tokenizer vocab size: {len(draft_tokenizer)}")
    print(f"target tokenizer vocab size: {len(target_tokenizer)}\n")

    # Load Model
    draft_model = AutoModelForCausalLM.from_pretrained(draft_model_path, torch_dtype=torch.bfloat16).to(device)
    target_model = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.bfloat16).to(device)

    # Tokenizer
    messages = [
        [
            {
                "role": "user",
                "content": "What is the capital of Taiwan. And why?",
            },
        ],
    ]


    # Tokenize
    input_text=target_tokenizer.apply_chat_template(messages, tokenize=False)
    inputs = draft_tokenizer(
        input_text,
        return_tensors="pt",
        max_length=512,
        truncation=True,
        padding=True,
    ).to(device)

    # Target Model Generate Directly
    start_time = time.time()
    outputs = target_model.generate(**inputs, max_new_tokens=100)

    generated_token_num = outputs.shape[-1] - inputs["input_ids"].shape[-1]

    print("=== Directly Generate ===")
    print(f"Generated Tokens: {generated_token_num}")
    print(f"Spent Time: {time.time() - start_time} seconds.\n")

    # Speculative Decoding
    print("=== Speculative Decoding ===")
    start_time = time.time()
    outputs = target_model.generate(
        **inputs,
        max_new_tokens=100,
        assistant_model=draft_model,
        tokenizer=target_tokenizer,
        assistant_tokenizer=draft_tokenizer,
    )

    generated_token_num = outputs.shape[-1] - inputs["input_ids"].shape[-1]

    print(f"Generated Tokens: {generated_token_num}")
    print(f"Spent Time: {time.time() - start_time} seconds.")


if __name__ == "__main__":
    main()

Output:

draft tokenizer vocab size: 50257
target tokenizer vocab size: 49152

=== Directly Generate ===
Generated Tokens: 100
Spent Time: 2.0288448333740234 seconds.

=== Speculative Decoding ===
Generated Tokens: 100
Spent Time: 2.306903839111328 seconds.

We can see that it is indeed possible to use draft and target models with different vocabularies for Speculative Decoding, but this requires specifying their respective tokenizers. During generation and verification, both tokenizers are used for encoding and decoding.

References

[Paper Reading] Fast Inference from Transformers via Speculative Decoding

Speculative Decoding Implementation Note (with Simple Experimental Results)

Using the `assistant_model` method in HuggingFace's `transformers` library to accelerate Speculative Decoding

`assistant_model` Usage

Support for Different Tokenizers in Speculative Decoding

References

Read More

Leave a ReplyCancel reply

Using the `assistant_model` method in HuggingFace's `transformers` library to accelerate Speculative Decoding

`assistant_model` Usage

Support for Different Tokenizers in Speculative Decoding

References

Read More

Share this:

Leave a ReplyCancel reply