Speculative Decoding Implementation Note (with Simple Experimental Results)

Clay
2024-11-092024-11-09
Machine Learning, PyTorch

Last Updated on 2024-11-09 by Clay

Introduction

Speculative Decoding is an extremely practical inference acceleration technique that enables a small model (draft model) to rapidly decode multiple tokens and retain the probability distribution of this process. Then, the larger target model, which we aim to accelerate, predicts the next token based on this draft. For each token position, the draft model’s probability distributions are computed and validated using the target model’s probabilities, accepting the tokens decoded by the draft model if they are deemed sufficiently reliable.

Clay
2024-11-082024-11-08
Machine Learning, PyTorch

Last Updated on 2024-11-08 by Clay

When we use large language models for generative tasks, particularly in auto-regressive tasks, the model essentially performs a massive classification task. The classification targets are the tokens in our vocabulary, which are the smallest building blocks that make up words.

If we want to use greedy decoding, we can simply take the maximum value of the logits in the final layer of the model’s decoding layer. However, if we want to introduce diversity and some level of randomness in the model’s output, we have several parameters we can adjust to turn the logits into a probability distribution.

Clay
2024-11-062024-11-06
AI, Machine Learning, Papers

Last Updated on 2024-11-06 by Clay

Abstract

In auto-regressive model decoding, if we need to decode K tokens, we must go through the process K times, which is the current bottleneck in the inference time of large language models.

Clay
2024-11-022024-11-02
Python

Last Updated on 2024-11-02 by Clay

I have recently set up numerous backend API servers for Chatbots. Initially, I received user messages and returned the entire LLM-generated reply in one go to the frontend interface. However, this approach did not provide a good user experience. I then switched to HTTP streaming, sending each generated token to the frontend as it was produced. Later, I found that some users’ devices experienced packet sticking, so I finally switched to using WebSocket.

Clay
2024-11-012024-11-01
AI, Machine Learning

Last Updated on 2024-11-01 by Clay

During the decoding process of large language models, especially in Auto-regressive models, decoding must be performed step-by-step until the entire sequence is generated. Within this process, there are caching techniques that can help reduce computation and improve decoding speed; one such technique is known as the KV Cache.

Clay
2024-10-292024-10-29
AI, Machine Learning

Last Updated on 2024-10-29 by Clay

When implementing various services through LLMs, do you worry about uncontrolled language generation? Recently, at a critical juncture in wrapping up a project, I used tools like Outlines to constrain LLM decoding, which effectively controlled the model’s output to follow the desired patterns. However, a colleague posed a deep question: What if I want it not to generate specific words?

Clay
2024-10-272024-10-27
Python

Last Updated on 2024-10-27 by Clay

bisect is a built-in Python module, primarily designed to maintain the order of a sorted list, allowing items to be inserted without the need to re-sort the entire list.

Clay
2024-10-262024-10-26
Linux, Python

Last Updated on 2024-10-26 by Clay

Introduction

Hydra is an open-source Python framework designed to simplify the research and deployment process, especially for complex applications. Hydra dynamically creates hierarchical configuration files during deployment and allows command line-based overwriting of these configurations.

Clay
2024-10-252024-10-25
Git, Github

Last Updated on 2024-10-25 by Clay

Problem Description

Today, while developing a web application with React.js for the frontend and Python Flask for the backend, I pushed the project to my GitHub repository after reaching a satisfactory milestone. However, upon checking the repository, I was surprised to find that I couldn’t access the folder my-app created by npx create-react-app my-app.

Clay
2024-10-242024-10-24
AI, Machine Learning

Last Updated on 2024-10-24 by Clay

I’ve always used rough formulas to estimate the relationship between the scale of my models and the GPU VRAM consumption; after all, there are too many variables involved—model architecture, number of layers, attention mechanism implementation, sequence length, batch size, data precision used in training or inference… all of these affect our final calculation results.

« Previous
1
2
3
4
5
…
82
Next »

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Speculative Decoding Implementation Note (with Simple Experimental Results)