Skip to content

AI

Using The Target Model's Confidence Threshold To Decide Whether To Enable Speculative Decoding

Many of the inference acceleration techniques I have studied, such as Speculative Decoding, predominantly use a threshold for the confidence scores of the draft model. This threshold determines how many draft tokens should be decoded before passing them to the target model for verification, thereby reducing the extra computational cost when the draft model operates with low confidence.

Read More »Using The Target Model's Confidence Threshold To Decide Whether To Enable Speculative Decoding

Using the `assistant_model` method in HuggingFace's `transformers` library to accelerate Speculative Decoding

Recently, I attempted to implement various speculative decoding acceleration methods. HuggingFace's transformers library also provides a corresponding acceleration feature called assistant_model. Today, let me take this opportunity to document it.

Read More »Using the `assistant_model` method in HuggingFace's `transformers` library to accelerate Speculative Decoding

Self-Speculative Decoding Implementation: LayerSkip Model, Bayesian Optimization, and Adaptive Draft-Exiting Mechanism (Here are gemma-2-9b-it Experiment Results)

Over the past week, I dedicated some time to reproducing the Self-Speculative Decoding mechanism based on the ideas from the paper Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding, implementing the following modules:

  • A Decoder-only Transformer model with layer skipping (based on Llama and Gemma-2 architectures)
  • Adaptive Draft Exit Mechanism
  • Bayesian Optimization to discover the best layer-skipping strategy (optimizing draft model configurations)
  • Self-Speculative Decoding — achieving acceleration purely through the model itself
Read More »Self-Speculative Decoding Implementation: LayerSkip Model, Bayesian Optimization, and Adaptive Draft-Exiting Mechanism (Here are gemma-2-9b-it Experiment Results)

[Paper Reading] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Highlights of This Paper

  • Quantization, pruning, and distillation can also accelerate models, but come with issues like changes in output distribution compared to the original model, as well as the cost of retraining.
  • The original Speculative Decoding faces the issue of requiring additional memory to run the draft model, whereas Self-Speculative Decoding uses part of its own neural network as the draft model.
  • The Adaptive Draft-Exiting Mechanism can automatically adjust the number of tokens predicted by the draft model based on confidence score thresholds.
Read More »[Paper Reading] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Optimizing LayerSkip Models with Bayesian Search for an Effective Layer Skipping Strategy

In self-speculative decoding, since our draft model is derived from part of the target model’s network, finding an optimal 'Layer Skip Strategy' is crucial. We need to skip enough layers to achieve meaningful speedup while ensuring the draft model’s speculative decoding is good enough to avoid frequent rejection by the target model.

Today’s implementation focuses on optimizing my previously implemented LayerSkip model using the Bayesian optimization framework Optuna, to determine which layers to skip.

Read More »Optimizing LayerSkip Models with Bayesian Search for an Effective Layer Skipping Strategy

Self-Speculative Decoding Implementation: LayerSkip Transformer

Introduction

Self-Speculative Decoding is a variant of Speculative Decoding. The original Speculative Decoding method uses a draft model to optimize the inference of the target model. The draft model, which is typically distilled from the target model, offers similar output quality but with several times faster inference speed.

Read More »Self-Speculative Decoding Implementation: LayerSkip Transformer

KV Cache: A Caching Mechanism To Accelerate Transformer Generation

During the decoding process of large language models, especially in Auto-regressive models, decoding must be performed step-by-step until the entire sequence is generated. Within this process, there are caching techniques that can help reduce computation and improve decoding speed; one such technique is known as the KV Cache.

Read More »KV Cache: A Caching Mechanism To Accelerate Transformer Generation

Using Finite State Machine (FSM) and Rollback Mechanism to Restrict LLM from Generating Banned Words

When implementing various services through LLMs, do you worry about uncontrolled language generation? Recently, at a critical juncture in wrapping up a project, I used tools like Outlines to constrain LLM decoding, which effectively controlled the model's output to follow the desired patterns. However, a colleague posed a deep question: What if I want it not to generate specific words?

Read More »Using Finite State Machine (FSM) and Rollback Mechanism to Restrict LLM from Generating Banned Words

Note on Calculating VRAM Consumption for Training and Inference of AI Models

I've always used rough formulas to estimate the relationship between the scale of my models and the GPU VRAM consumption; after all, there are too many variables involved—model architecture, number of layers, attention mechanism implementation, sequence length, batch size, data precision used in training or inference... all of these affect our final calculation results.

Read More »Note on Calculating VRAM Consumption for Training and Inference of AI Models