Skip to content

AI

[Paper Reading] Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

Currently, most of the time spent during LLM inference is bottlenecked by the need to generate tokens sequentially. This highlights a limitation imposed by GPU memory bandwidth — for every single token decoded, the model’s entire weight must be loaded, even though the actual floating-point computation is minimal. This leads to underutilization of the GPU’s computational capabilities.

Read More »[Paper Reading] Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

Personal Interpretation of Cogito Trained with Iterated Distillation and Amplification (IDA)

Cogito V1 is a model I recently came across on Reddit that demonstrated impressive performance. It was also recommended by my colleagues just a day earlier. I decided to try it out on a RAG task I was working on, and the results were quite astonishing — most notably, it refrained from hallucinations when relevant reference materials were retrieved and was able to effectively synthesize information from multiple sources. Among the models I’ve tested, only Gemma-3 gave me a similar experience without requiring fine-tuning.

Read More »Personal Interpretation of Cogito Trained with Iterated Distillation and Amplification (IDA)

Implementation Notes on Integrating Speculative Decoding with KV Cache

Introduction

Speculative Decoding and KV Cache are both acceleration techniques applicable to Transformer models. The former uses a faster draft model to speculatively generate several subsequent tokens, which are then validated in a batch by the target model to reduce the cost of autoregressive decoding. The latter leverages the causal attention mechanism of Transformers—where past tokens do not attend to future tokens—to cache previously computed results and avoid redundant calculations during inference.

Read More »Implementation Notes on Integrating Speculative Decoding with KV Cache

Using The Target Model’s Confidence Threshold To Decide Whether To Enable Speculative Decoding

Many of the inference acceleration techniques I have studied, such as Speculative Decoding, predominantly use a threshold for the confidence scores of the draft model. This threshold determines how many draft tokens should be decoded before passing them to the target model for verification, thereby reducing the extra computational cost when the draft model operates with low confidence.

Read More »Using The Target Model’s Confidence Threshold To Decide Whether To Enable Speculative Decoding

Using the `assistant_model` method in HuggingFace’s `transformers` library to accelerate Speculative Decoding

Recently, I attempted to implement various speculative decoding acceleration methods. HuggingFace’s transformers library also provides a corresponding acceleration feature called assistant_model. Today, let me take this opportunity to document it.

Read More »Using the `assistant_model` method in HuggingFace’s `transformers` library to accelerate Speculative Decoding
Exit mobile version