PyTorch

Implementation Notes on Integrating Speculative Decoding with KV Cache

Clay
2025-07-012025-07-01
AI, Machine Learning, PyTorch

Introduction

Speculative Decoding and KV Cache are both acceleration techniques applicable to Transformer models. The former uses a faster draft model to speculatively generate several subsequent tokens, which are then validated in a batch by the target model to reduce the cost of autoregressive decoding. The latter leverages the causal attention mechanism of Transformers—where past tokens do not attend to future tokens—to cache previously computed results and avoid redundant calculations during inference.

Using The Target Model’s Confidence Threshold To Decide Whether To Enable Speculative Decoding

Clay
2024-11-222024-11-22
AI, Machine Learning, PyTorch

Many of the inference acceleration techniques I have studied, such as Speculative Decoding, predominantly use a threshold for the confidence scores of the draft model. This threshold determines how many draft tokens should be decoded before passing them to the target model for verification, thereby reducing the extra computational cost when the draft model operates with low confidence.

Self-Speculative Decoding Implementation: LayerSkip Model, Bayesian Optimization, and Adaptive Draft-Exiting Mechanism (Here are gemma-2-9b-it Experiment Results)

Clay
2024-11-192024-11-19
AI, Machine Learning, Python, PyTorch

Over the past week, I dedicated some time to reproducing the Self-Speculative Decoding mechanism based on the ideas from the paper Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding, implementing the following modules:

A Decoder-only Transformer model with layer skipping (based on Llama and Gemma-2 architectures)
Adaptive Draft Exit Mechanism
Bayesian Optimization to discover the best layer-skipping strategy (optimizing draft model configurations)
Self-Speculative Decoding — achieving acceleration purely through the model itself

Self-Speculative Decoding Implementation: LayerSkip Transformer

Clay
2024-11-122024-11-12
AI, Machine Learning, PyTorch

Introduction

Self-Speculative Decoding is a variant of Speculative Decoding. The original Speculative Decoding method uses a draft model to optimize the inference of the target model. The draft model, which is typically distilled from the target model, offers similar output quality but with several times faster inference speed.

Speculative Decoding Implementation Note (with Simple Experimental Results)

Clay
2024-11-092024-11-09
Machine Learning, PyTorch

Introduction

Speculative Decoding is an extremely practical inference acceleration technique that enables a small model (draft model) to rapidly decode multiple tokens and retain the probability distribution of this process. Then, the larger target model, which we aim to accelerate, predicts the next token based on this draft. For each token position, the draft model’s probability distributions are computed and validated using the target model’s probabilities, accepting the tokens decoded by the draft model if they are deemed sufficiently reliable.

A Note Of Large Language Model Decode Sampling

Clay
2024-11-082024-11-08
Machine Learning, PyTorch

When we use large language models for generative tasks, particularly in auto-regressive tasks, the model essentially performs a massive classification task. The classification targets are the tokens in our vocabulary, which are the smallest building blocks that make up words.

If we want to use greedy decoding, we can simply take the maximum value of the logits in the final layer of the model’s decoding layer. However, if we want to introduce diversity and some level of randomness in the model’s output, we have several parameters we can adjust to turn the logits into a probability distribution.

Notes on Fine-Tuning a Multi-Modal Large Language Model Using SFTTrainer (Taking LLaVa-1.5 as an Example)

Clay
2024-10-082024-10-08
AI, Machine Learning, PyTorch

A multi-modal large language model (Multi-Modal Large Language Model) isn’t limited to text only. I know this might sound contradictory, but this is a term that has become widely accepted. What I want to document today is how to fine-tune a multi-modal model using a script.

[PyTorch] Traversing Every Layer of a Neural Network in a Model

Clay
2024-09-102024-09-10
Machine Learning, PyTorch

Introduction

Recently, due to some serendipitous events, I had a chance to modify the architecture of a model slightly. I took this opportunity to explore how to iterate and print the layers of neural networks in PyTorch.

OpenAI Triton Note (2): Fused Softmax

Clay
2024-09-092024-09-09
Machine Learning, PyTorch

Introduction

Softmax is a commonly used activation function, and it is often employed as the last layer in multi-class classification.

OpenAI Triton Note (1): Vector Addition

Clay
2024-09-082024-09-08
Machine Learning, PyTorch

Introduction

Triton is an open-source GPU programming language compiler released by OpenAI in 2021. Over recent years, it has become increasingly popular among developers for writing and optimizing parallel programs on GPUs. Compared to traditional libraries such as CUDA or OpenCL, Triton offers a Python-like syntax, making it more readable and easier to learn.

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31