Skip to content

PyTorch

Using The Target Model's Confidence Threshold To Decide Whether To Enable Speculative Decoding

Many of the inference acceleration techniques I have studied, such as Speculative Decoding, predominantly use a threshold for the confidence scores of the draft model. This threshold determines how many draft tokens should be decoded before passing them to the target model for verification, thereby reducing the extra computational cost when the draft model operates with low confidence.

Read More »Using The Target Model's Confidence Threshold To Decide Whether To Enable Speculative Decoding

Self-Speculative Decoding Implementation: LayerSkip Model, Bayesian Optimization, and Adaptive Draft-Exiting Mechanism (Here are gemma-2-9b-it Experiment Results)

Over the past week, I dedicated some time to reproducing the Self-Speculative Decoding mechanism based on the ideas from the paper Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding, implementing the following modules:

  • A Decoder-only Transformer model with layer skipping (based on Llama and Gemma-2 architectures)
  • Adaptive Draft Exit Mechanism
  • Bayesian Optimization to discover the best layer-skipping strategy (optimizing draft model configurations)
  • Self-Speculative Decoding — achieving acceleration purely through the model itself
Read More »Self-Speculative Decoding Implementation: LayerSkip Model, Bayesian Optimization, and Adaptive Draft-Exiting Mechanism (Here are gemma-2-9b-it Experiment Results)

Self-Speculative Decoding Implementation: LayerSkip Transformer

Introduction

Self-Speculative Decoding is a variant of Speculative Decoding. The original Speculative Decoding method uses a draft model to optimize the inference of the target model. The draft model, which is typically distilled from the target model, offers similar output quality but with several times faster inference speed.

Read More »Self-Speculative Decoding Implementation: LayerSkip Transformer

Speculative Decoding Implementation Note (with Simple Experimental Results)

Introduction

Speculative Decoding is an extremely practical inference acceleration technique that enables a small model (draft model) to rapidly decode multiple tokens and retain the probability distribution of this process. Then, the larger target model, which we aim to accelerate, predicts the next token based on this draft. For each token position, the draft model’s probability distributions are computed and validated using the target model's probabilities, accepting the tokens decoded by the draft model if they are deemed sufficiently reliable.

Read More »Speculative Decoding Implementation Note (with Simple Experimental Results)

A Note Of Large Language Model Decode Sampling

When we use large language models for generative tasks, particularly in auto-regressive tasks, the model essentially performs a massive classification task. The classification targets are the tokens in our vocabulary, which are the smallest building blocks that make up words.

If we want to use greedy decoding, we can simply take the maximum value of the logits in the final layer of the model's decoding layer. However, if we want to introduce diversity and some level of randomness in the model's output, we have several parameters we can adjust to turn the logits into a probability distribution.

Read More »A Note Of Large Language Model Decode Sampling

Notes on Fine-Tuning a Multi-Modal Large Language Model Using SFTTrainer (Taking LLaVa-1.5 as an Example)

A multi-modal large language model (Multi-Modal Large Language Model) isn’t limited to text only. I know this might sound contradictory, but this is a term that has become widely accepted. What I want to document today is how to fine-tune a multi-modal model using a script.

Read More »Notes on Fine-Tuning a Multi-Modal Large Language Model Using SFTTrainer (Taking LLaVa-1.5 as an Example)

OpenAI Triton Note (1): Vector Addition

Introduction

Triton is an open-source GPU programming language compiler released by OpenAI in 2021. Over recent years, it has become increasingly popular among developers for writing and optimizing parallel programs on GPUs. Compared to traditional libraries such as CUDA or OpenCL, Triton offers a Python-like syntax, making it more readable and easier to learn.

Read More »OpenAI Triton Note (1): Vector Addition

[PyTorch] BERT Architecture Implementation Note

Introduction

My advisor used to tell me, “Don't just use other people's libraries; you have to write your own to truly understand.” Back then, I didn’t have much time to implement various technologies I was interested in since I was fully occupied with my dissertation. However, I often recall his earnest advice even now, and it prompted me to finally attempt the implementation of BERT, a classic encoder-only transformer model.

Read More »[PyTorch] BERT Architecture Implementation Note