Skip to content

Supporting Hydra Speculative Decoding on TensorRT-LLM Python Session

Supporting Hydra Speculative Decoding on TensorRT-LLM Python Session

Last Updated on 2025-07-01 by Clay

Introduction

I’ve previously studied many different speculative decoding acceleration techniques and attempted to implement several architectures using PyTorch, including model architecture, training, and inference scripts (fast-llm-inference). This time, of course, I have a new goal.

Read More »Supporting Hydra Speculative Decoding on TensorRT-LLM Python Session

[Paper Reading] Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

Last Updated on 2025-07-01 by Clay

Currently, most of the time spent during LLM inference is bottlenecked by the need to generate tokens sequentially. This highlights a limitation imposed by GPU memory bandwidth — for every single token decoded, the model’s entire weight must be loaded, even though the actual floating-point computation is minimal. This leads to underutilization of the GPU’s computational capabilities.

Read More »[Paper Reading] Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

Personal Interpretation of Cogito Trained with Iterated Distillation and Amplification (IDA)

Last Updated on 2025-07-01 by Clay

Cogito V1 is a model I recently came across on Reddit that demonstrated impressive performance. It was also recommended by my colleagues just a day earlier. I decided to try it out on a RAG task I was working on, and the results were quite astonishing — most notably, it refrained from hallucinations when relevant reference materials were retrieved and was able to effectively synthesize information from multiple sources. Among the models I’ve tested, only Gemma-3 gave me a similar experience without requiring fine-tuning.

Read More »Personal Interpretation of Cogito Trained with Iterated Distillation and Amplification (IDA)

[Paper Reading] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Last Updated on 2025-07-01 by Clay

Recently, I’ve still been diving into inference acceleration techniques, but work has kept me too busy to publish any updates. Today, I’m introducing a classic multi-head decoding architecture called Medusa.

Read More »[Paper Reading] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Implementation Notes on Integrating Speculative Decoding with KV Cache

Last Updated on 2025-07-01 by Clay

Introduction

Speculative Decoding and KV Cache are both acceleration techniques applicable to Transformer models. The former uses a faster draft model to speculatively generate several subsequent tokens, which are then validated in a batch by the target model to reduce the cost of autoregressive decoding. The latter leverages the causal attention mechanism of Transformers—where past tokens do not attend to future tokens—to cache previously computed results and avoid redundant calculations during inference.

Read More »Implementation Notes on Integrating Speculative Decoding with KV Cache

Why Do We Forget What We Learn? Understanding the Forgetting Curve

Last Updated on 2025-05-06 by Clay

Preface

I’ve always tried to keep myself in a state of continuous learning. Yet, there are days when work gets hectic or friends invite me out, and by the time I get home, I’m too exhausted to study. I just play PS5 for a while, take a quick shower, and go to bed. While these days are relaxing and carefree, deep down I worry that if I don’t study regularly, I’ll begin to forget what I’ve learned — just like the saying goes: “Learning is like rowing upstream; not to advance is to drop back.”

Read More »Why Do We Forget What We Learn? Understanding the Forgetting Curve
Exit mobile version