Skip to content

Clay

KV Cache: A Caching Mechanism To Accelerate Transformer Generation

During the decoding process of large language models, especially in Auto-regressive models, decoding must be performed step-by-step until the entire sequence is generated. Within this process, there are caching techniques that can help reduce computation and improve decoding speed; one such technique is known as the KV Cache.

Read More »KV Cache: A Caching Mechanism To Accelerate Transformer Generation

Using Finite State Machine (FSM) and Rollback Mechanism to Restrict LLM from Generating Banned Words

When implementing various services through LLMs, do you worry about uncontrolled language generation? Recently, at a critical juncture in wrapping up a project, I used tools like Outlines to constrain LLM decoding, which effectively controlled the model's output to follow the desired patterns. However, a colleague posed a deep question: What if I want it not to generate specific words?

Read More »Using Finite State Machine (FSM) and Rollback Mechanism to Restrict LLM from Generating Banned Words

[Solved] Unable to View Folder with Arrow Icon in GitHub Project

Problem Description

Today, while developing a web application with React.js for the frontend and Python Flask for the backend, I pushed the project to my GitHub repository after reaching a satisfactory milestone. However, upon checking the repository, I was surprised to find that I couldn’t access the folder my-app created by npx create-react-app my-app.

Read More »[Solved] Unable to View Folder with Arrow Icon in GitHub Project

Note on Calculating VRAM Consumption for Training and Inference of AI Models

I've always used rough formulas to estimate the relationship between the scale of my models and the GPU VRAM consumption; after all, there are too many variables involved—model architecture, number of layers, attention mechanism implementation, sequence length, batch size, data precision used in training or inference... all of these affect our final calculation results.

Read More »Note on Calculating VRAM Consumption for Training and Inference of AI Models

Here’s a thought: Will Transformers be replaced in the future?

Today, while I was eating, I came across a video (the video is attached at the end of this article). Unlike many tech channels that jump straight into discussing AI, economics, and replacing humans, this video took a more careful approach. It explained in detail how hardware specifications have influenced algorithms (or AI model architectures) over time.

Read More »Here’s a thought: Will Transformers be replaced in the future?

Note Of KTOTrainer (Kahneman-Tversky Optimization Trainer)

I've been intermittently reading about a fine-tuning method called Kahneman-Tversky Optimization (KTO) from various sources like HuggingFace's official documents and other online materials. It's similar to DPO as a way to align models with human values, but KTO's data preparation format is much more convenient, so I'm quickly applying it to my current tasks before making time to study the detailed content in the related papers.

Read More »Note Of KTOTrainer (Kahneman-Tversky Optimization Trainer)