Clay

Accelerating vLLM with Arctic Inference and Custom Speculators

Clay
2025-05-102025-05-10
AI, Machine Learning

[Data Structure] Min-Max Heap

Clay
2025-05-082025-05-08
Data Structure

Why Do We Forget What We Learn? Understanding the Forgetting Curve

Clay
2025-05-062025-05-06
Essay

Preface

I’ve always tried to keep myself in a state of continuous learning. Yet, there are days when work gets hectic or friends invite me out, and by the time I get home, I’m too exhausted to study. I just play PS5 for a while, take a quick shower, and go to bed. While these days are relaxing and carefree, deep down I worry that if I don’t study regularly, I’ll begin to forget what I’ve learned — just like the saying goes: "Learning is like rowing upstream; not to advance is to drop back."

Clay
2024-12-102024-12-10
AI, Machine Learning

Introduction

Kangaroo is an implementation of Self-Speculative Decoding that introduces a trainable adapter layer. Over the past few weeks, I have been working on fine-tuning its adapter layer and have achieved some preliminary results, which I am documenting here.

Clay
2024-12-032024-12-03
Machine Learning

Introduction

Recently, while implementing the paper Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, I encountered a question about its use of Cross Entropy Loss to align the probability distributions of the draft model and the target model. Why not use KL Divergence instead?

Clay
2024-11-272024-11-27
Computer

Recently, I have been reviewing notes on distributed systems to reflect on the systems I built over the past year and examine potential areas for improvement. Someone recommended studying the CAP theorem, and after reading it, I found it quite intuitive, so I decided to document it here.

Clay
2024-11-222024-11-22
AI, Machine Learning, PyTorch

Many of the inference acceleration techniques I have studied, such as Speculative Decoding, predominantly use a threshold for the confidence scores of the draft model. This threshold determines how many draft tokens should be decoded before passing them to the target model for verification, thereby reducing the extra computational cost when the draft model operates with low confidence.

Clay
2024-11-202024-11-20
AI, Machine Learning

Recently, I attempted to implement various speculative decoding acceleration methods. HuggingFace's transformers library also provides a corresponding acceleration feature called assistant_model. Today, let me take this opportunity to document it.

Clay
2024-11-192024-11-19
AI, Machine Learning, Python, PyTorch

Over the past week, I dedicated some time to reproducing the Self-Speculative Decoding mechanism based on the ideas from the paper Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding, implementing the following modules:

A Decoder-only Transformer model with layer skipping (based on Llama and Gemma-2 architectures)
Adaptive Draft Exit Mechanism
Bayesian Optimization to discover the best layer-skipping strategy (optimizing draft model configurations)
Self-Speculative Decoding — achieving acceleration purely through the model itself

Clay
2024-11-162024-11-16
AI, Machine Learning, Papers

Highlights of This Paper

Quantization, pruning, and distillation can also accelerate models, but come with issues like changes in output distribution compared to the original model, as well as the cost of retraining.
The original Speculative Decoding faces the issue of requiring additional memory to run the draft model, whereas Self-Speculative Decoding uses part of its own neural network as the draft model.
The Adaptive Draft-Exiting Mechanism can automatically adjust the number of tokens predicted by the draft model based on confidence score thresholds.

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Clay

Accelerating vLLM with Arctic Inference and Custom Speculators

[Data Structure] Min-Max Heap

Why Do We Forget What We Learn? Understanding the Forgetting Curve

Preface

Kangaroo: Inference Acceleration Architecture Implementation

Introduction

Differences and Comparison Between KL Divergence and Cross Entropy

Introduction

A Note Of Brewer's/CAP Theorem

Using The Target Model's Confidence Threshold To Decide Whether To Enable Speculative Decoding

Using the `assistant_model` method in HuggingFace's `transformers` library to accelerate Speculative Decoding

Self-Speculative Decoding Implementation: LayerSkip Model, Bayesian Optimization, and Adaptive Draft-Exiting Mechanism (Here are gemma-2-9b-it Experiment Results)

[Paper Reading] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Highlights of This Paper