November 2024

A Note Of Brewer’s/CAP Theorem

Clay
2024-11-272024-11-27
Computer

Last Updated on 2024-11-27 by Clay

Recently, I have been reviewing notes on distributed systems to reflect on the systems I built over the past year and examine potential areas for improvement. Someone recommended studying the CAP theorem, and after reading it, I found it quite intuitive, so I decided to document it here.

Using The Target Model’s Confidence Threshold To Decide Whether To Enable Speculative Decoding

Clay
2024-11-222024-11-22
AI, Machine Learning, PyTorch

Last Updated on 2024-11-22 by Clay

Many of the inference acceleration techniques I have studied, such as Speculative Decoding, predominantly use a threshold for the confidence scores of the draft model. This threshold determines how many draft tokens should be decoded before passing them to the target model for verification, thereby reducing the extra computational cost when the draft model operates with low confidence.

Using the `assistant_model` method in HuggingFace’s `transformers` library to accelerate Speculative Decoding

Clay
2024-11-202024-11-20
AI, Machine Learning

Last Updated on 2024-11-20 by Clay

Recently, I attempted to implement various speculative decoding acceleration methods. HuggingFace’s transformers library also provides a corresponding acceleration feature called assistant_model. Today, let me take this opportunity to document it.

Self-Speculative Decoding Implementation: LayerSkip Model, Bayesian Optimization, and Adaptive Draft-Exiting Mechanism (Here are gemma-2-9b-it Experiment Results)

Clay
2024-11-192024-11-19
AI, Machine Learning, Python, PyTorch

Last Updated on 2024-11-19 by Clay

Over the past week, I dedicated some time to reproducing the Self-Speculative Decoding mechanism based on the ideas from the paper Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding, implementing the following modules:

A Decoder-only Transformer model with layer skipping (based on Llama and Gemma-2 architectures)
Adaptive Draft Exit Mechanism
Bayesian Optimization to discover the best layer-skipping strategy (optimizing draft model configurations)
Self-Speculative Decoding — achieving acceleration purely through the model itself

[Paper Reading] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Clay
2024-11-162024-11-16
AI, Machine Learning, Papers

Last Updated on 2024-11-16 by Clay

Highlights of This Paper

Quantization, pruning, and distillation can also accelerate models, but come with issues like changes in output distribution compared to the original model, as well as the cost of retraining.
The original Speculative Decoding faces the issue of requiring additional memory to run the draft model, whereas Self-Speculative Decoding uses part of its own neural network as the draft model.
The Adaptive Draft-Exiting Mechanism can automatically adjust the number of tokens predicted by the draft model based on confidence score thresholds.

Optimizing LayerSkip Models with Bayesian Search for an Effective Layer Skipping Strategy

Clay
2024-11-152024-11-15
AI, Machine Learning

Last Updated on 2024-11-15 by Clay

In self-speculative decoding, since our draft model is derived from part of the target model’s network, finding an optimal ‘Layer Skip Strategy’ is crucial. We need to skip enough layers to achieve meaningful speedup while ensuring the draft model’s speculative decoding is good enough to avoid frequent rejection by the target model.

Today’s implementation focuses on optimizing my previously implemented LayerSkip model using the Bayesian optimization framework Optuna, to determine which layers to skip.

Self-Speculative Decoding Implementation: LayerSkip Transformer

Clay
2024-11-122024-11-12
AI, Machine Learning, PyTorch

Last Updated on 2024-11-12 by Clay

Introduction

Self-Speculative Decoding is a variant of Speculative Decoding. The original Speculative Decoding method uses a draft model to optimize the inference of the target model. The draft model, which is typically distilled from the target model, offers similar output quality but with several times faster inference speed.

A Note of Bayes’ Theorem

Clay
2024-11-112024-11-11
Math

Last Updated on 2024-11-11 by Clay

Introduction

Recently, I’ve been trying to organize the papers on accelerated reasoning techniques I’ve read over the past year into notes. During this process, I came across Bayesian optimization techniques that utilize Bayes’ theorem, so I decided to write a note to record the essence of Bayes’ theorem.

In simple terms, Bayes’ theorem is a frequently encountered theorem in probability theory that describes the probability of a random event occurring under specific conditions.

Speculative Decoding Implementation Note (with Simple Experimental Results)

Clay
2024-11-092024-11-09
Machine Learning, PyTorch

Last Updated on 2024-11-09 by Clay

Introduction

Speculative Decoding is an extremely practical inference acceleration technique that enables a small model (draft model) to rapidly decode multiple tokens and retain the probability distribution of this process. Then, the larger target model, which we aim to accelerate, predicts the next token based on this draft. For each token position, the draft model’s probability distributions are computed and validated using the target model’s probabilities, accepting the tokens decoded by the draft model if they are deemed sufficiently reliable.

A Note Of Large Language Model Decode Sampling

Clay
2024-11-082024-11-08
Machine Learning, PyTorch

Last Updated on 2024-11-08 by Clay

When we use large language models for generative tasks, particularly in auto-regressive tasks, the model essentially performs a massive classification task. The classification targets are the tokens in our vocabulary, which are the smallest building blocks that make up words.

If we want to use greedy decoding, we can simply take the maximum value of the logits in the final layer of the model’s decoding layer. However, if we want to introduce diversity and some level of randomness in the model’s output, we have several parameters we can adjust to turn the logits into a probability distribution.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30