Machine Learning

Supporting Hydra Speculative Decoding on TensorRT-LLM Python Session

Clay
2025-07-012025-07-01
AI, Machine Learning, Python

Introduction

I’ve previously studied many different speculative decoding acceleration techniques and attempted to implement several architectures using PyTorch, including model architecture, training, and inference scripts (fast-llm-inference). This time, of course, I have a new goal.

[Paper Reading] Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

Clay
2025-07-012025-07-01
AI, Machine Learning, Papers

Currently, most of the time spent during LLM inference is bottlenecked by the need to generate tokens sequentially. This highlights a limitation imposed by GPU memory bandwidth — for every single token decoded, the model’s entire weight must be loaded, even though the actual floating-point computation is minimal. This leads to underutilization of the GPU’s computational capabilities.

Personal Interpretation of Cogito Trained with Iterated Distillation and Amplification (IDA)

Clay
2025-07-012025-07-01
AI, Machine Learning

Cogito V1 is a model I recently came across on Reddit that demonstrated impressive performance. It was also recommended by my colleagues just a day earlier. I decided to try it out on a RAG task I was working on, and the results were quite astonishing — most notably, it refrained from hallucinations when relevant reference materials were retrieved and was able to effectively synthesize information from multiple sources. Among the models I’ve tested, only Gemma-3 gave me a similar experience without requiring fine-tuning.

[Paper Reading] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Clay
2025-07-012025-07-01
AI, Machine Learning, Papers

Recently, I’ve still been diving into inference acceleration techniques, but work has kept me too busy to publish any updates. Today, I’m introducing a classic multi-head decoding architecture called Medusa.

Implementation Notes on Integrating Speculative Decoding with KV Cache

Clay
2025-07-012025-07-01
AI, Machine Learning, PyTorch

Introduction

Speculative Decoding and KV Cache are both acceleration techniques applicable to Transformer models. The former uses a faster draft model to speculatively generate several subsequent tokens, which are then validated in a batch by the target model to reduce the cost of autoregressive decoding. The latter leverages the causal attention mechanism of Transformers—where past tokens do not attend to future tokens—to cache previously computed results and avoid redundant calculations during inference.

[Paper Reading] s1: Simple test-time scaling

Clay
2025-07-012025-07-01
AI, Machine Learning, Papers

S1 Core Contributions

Test-Time Scaling has become a popular approach for enhancing LLM performance. The idea is to let the model “think” and organize its thoughts before providing an answer, resulting in improved accuracy.

Thoughts on LayerNorm (Theory)

Clay
2025-07-012025-07-01
Machine Learning

I previously attempted to implement LayerNorm while reading through model architecture source code ([Machine Learning] Note of LayerNorm). However, that implementation merely followed the formula mechanically. Recently, while revisiting architectural design, I developed a deeper understanding of LayerNorm, and thus recorded my thoughts here.

Accelerating vLLM with Arctic Inference and Custom Speculators

Clay
2025-05-102025-05-10
AI, Machine Learning

Kangaroo: Inference Acceleration Architecture Implementation

Clay
2024-12-102024-12-10
AI, Machine Learning

Introduction

Kangaroo is an implementation of Self-Speculative Decoding that introduces a trainable adapter layer. Over the past few weeks, I have been working on fine-tuning its adapter layer and have achieved some preliminary results, which I am documenting here.

Differences and Comparison Between KL Divergence and Cross Entropy

Clay
2024-12-032024-12-03
Machine Learning

Introduction

Recently, while implementing the paper Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, I encountered a question about its use of Cross Entropy Loss to align the probability distributions of the draft model and the target model. Why not use KL Divergence instead?

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31