Skip to content

Clay

Optimizing LayerSkip Models with Bayesian Search for an Effective Layer Skipping Strategy

In self-speculative decoding, since our draft model is derived from part of the target model’s network, finding an optimal 'Layer Skip Strategy' is crucial. We need to skip enough layers to achieve meaningful speedup while ensuring the draft model’s speculative decoding is good enough to avoid frequent rejection by the target model.

Today’s implementation focuses on optimizing my previously implemented LayerSkip model using the Bayesian optimization framework Optuna, to determine which layers to skip.

Read More »Optimizing LayerSkip Models with Bayesian Search for an Effective Layer Skipping Strategy

Self-Speculative Decoding Implementation: LayerSkip Transformer

Introduction

Self-Speculative Decoding is a variant of Speculative Decoding. The original Speculative Decoding method uses a draft model to optimize the inference of the target model. The draft model, which is typically distilled from the target model, offers similar output quality but with several times faster inference speed.

Read More »Self-Speculative Decoding Implementation: LayerSkip Transformer

A Note of Bayes' Theorem

Introduction

Recently, I've been trying to organize the papers on accelerated reasoning techniques I've read over the past year into notes. During this process, I came across Bayesian optimization techniques that utilize Bayes' theorem, so I decided to write a note to record the essence of Bayes' theorem.

In simple terms, Bayes' theorem is a frequently encountered theorem in probability theory that describes the probability of a random event occurring under specific conditions.

Read More »A Note of Bayes' Theorem

Speculative Decoding Implementation Note (with Simple Experimental Results)

Introduction

Speculative Decoding is an extremely practical inference acceleration technique that enables a small model (draft model) to rapidly decode multiple tokens and retain the probability distribution of this process. Then, the larger target model, which we aim to accelerate, predicts the next token based on this draft. For each token position, the draft model’s probability distributions are computed and validated using the target model's probabilities, accepting the tokens decoded by the draft model if they are deemed sufficiently reliable.

Read More »Speculative Decoding Implementation Note (with Simple Experimental Results)

A Note Of Large Language Model Decode Sampling

When we use large language models for generative tasks, particularly in auto-regressive tasks, the model essentially performs a massive classification task. The classification targets are the tokens in our vocabulary, which are the smallest building blocks that make up words.

If we want to use greedy decoding, we can simply take the maximum value of the logits in the final layer of the model's decoding layer. However, if we want to introduce diversity and some level of randomness in the model's output, we have several parameters we can adjust to turn the logits into a probability distribution.

Read More »A Note Of Large Language Model Decode Sampling

[Python] FastAPI Using Server-Sent Events (SSE) for Streaming Responses

I have recently set up numerous backend API servers for Chatbots. Initially, I received user messages and returned the entire LLM-generated reply in one go to the frontend interface. However, this approach did not provide a good user experience. I then switched to HTTP streaming, sending each generated token to the frontend as it was produced. Later, I found that some users' devices experienced packet sticking, so I finally switched to using WebSocket.

Read More »[Python] FastAPI Using Server-Sent Events (SSE) for Streaming Responses

KV Cache: A Caching Mechanism To Accelerate Transformer Generation

During the decoding process of large language models, especially in Auto-regressive models, decoding must be performed step-by-step until the entire sequence is generated. Within this process, there are caching techniques that can help reduce computation and improve decoding speed; one such technique is known as the KV Cache.

Read More »KV Cache: A Caching Mechanism To Accelerate Transformer Generation

Using Finite State Machine (FSM) and Rollback Mechanism to Restrict LLM from Generating Banned Words

When implementing various services through LLMs, do you worry about uncontrolled language generation? Recently, at a critical juncture in wrapping up a project, I used tools like Outlines to constrain LLM decoding, which effectively controlled the model's output to follow the desired patterns. However, a colleague posed a deep question: What if I want it not to generate specific words?

Read More »Using Finite State Machine (FSM) and Rollback Mechanism to Restrict LLM from Generating Banned Words
Exit mobile version