Clay

Speculative Decoding Implementation Note (with Simple Experimental Results)

Clay
2024-11-092024-11-09
Machine Learning, PyTorch

Introduction

Speculative Decoding is an extremely practical inference acceleration technique that enables a small model (draft model) to rapidly decode multiple tokens and retain the probability distribution of this process. Then, the larger target model, which we aim to accelerate, predicts the next token based on this draft. For each token position, the draft model’s probability distributions are computed and validated using the target model’s probabilities, accepting the tokens decoded by the draft model if they are deemed sufficiently reliable.

A Note Of Large Language Model Decode Sampling

Clay
2024-11-082024-11-08
Machine Learning, PyTorch

When we use large language models for generative tasks, particularly in auto-regressive tasks, the model essentially performs a massive classification task. The classification targets are the tokens in our vocabulary, which are the smallest building blocks that make up words.

If we want to use greedy decoding, we can simply take the maximum value of the logits in the final layer of the model’s decoding layer. However, if we want to introduce diversity and some level of randomness in the model’s output, we have several parameters we can adjust to turn the logits into a probability distribution.

[Paper Reading] Fast Inference from Transformers via Speculative Decoding

Clay
2024-11-062024-11-06
AI, Machine Learning, Papers

Abstract

In auto-regressive model decoding, if we need to decode K tokens, we must go through the process K times, which is the current bottleneck in the inference time of large language models.

[Python] FastAPI Using Server-Sent Events (SSE) for Streaming Responses

Clay
2024-11-022024-11-02
Python

I have recently set up numerous backend API servers for Chatbots. Initially, I received user messages and returned the entire LLM-generated reply in one go to the frontend interface. However, this approach did not provide a good user experience. I then switched to HTTP streaming, sending each generated token to the frontend as it was produced. Later, I found that some users’ devices experienced packet sticking, so I finally switched to using WebSocket.

KV Cache: A Caching Mechanism To Accelerate Transformer Generation

Clay
2024-11-012024-11-01
AI, Machine Learning

During the decoding process of large language models, especially in Auto-regressive models, decoding must be performed step-by-step until the entire sequence is generated. Within this process, there are caching techniques that can help reduce computation and improve decoding speed; one such technique is known as the KV Cache.

Using Finite State Machine (FSM) and Rollback Mechanism to Restrict LLM from Generating Banned Words

Clay
2024-10-292024-10-29
AI, Machine Learning

When implementing various services through LLMs, do you worry about uncontrolled language generation? Recently, at a critical juncture in wrapping up a project, I used tools like Outlines to constrain LLM decoding, which effectively controlled the model’s output to follow the desired patterns. However, a colleague posed a deep question: What if I want it not to generate specific words?

[Python] Array Bisection Algorithm bisect Note

Clay
2024-10-272024-10-27
Python

bisect is a built-in Python module, primarily designed to maintain the order of a sorted list, allowing items to be inserted without the need to re-sort the entire list.

Note Of Hydra: Environment Configure Manager Package

Clay
2024-10-262024-10-26
Linux, Python

Introduction

Hydra is an open-source Python framework designed to simplify the research and deployment process, especially for complex applications. Hydra dynamically creates hierarchical configuration files during deployment and allows command line-based overwriting of these configurations.

[Solved] Unable to View Folder with Arrow Icon in GitHub Project

Clay
2024-10-252024-10-25
Git, Github

Problem Description

Today, while developing a web application with React.js for the frontend and Python Flask for the backend, I pushed the project to my GitHub repository after reaching a satisfactory milestone. However, upon checking the repository, I was surprised to find that I couldn’t access the folder my-app created by npx create-react-app my-app.

Note on Calculating VRAM Consumption for Training and Inference of AI Models

Clay
2024-10-242024-10-24
AI, Machine Learning

I’ve always used rough formulas to estimate the relationship between the scale of my models and the GPU VRAM consumption; after all, there are too many variables involved—model architecture, number of layers, attention mechanism implementation, sequence length, batch size, data precision used in training or inference… all of these affect our final calculation results.

« Previous
1
2
3
4
5
…
82
Next »

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31