Python

Self-Speculative Decoding Implementation: LayerSkip Model, Bayesian Optimization, and Adaptive Draft-Exiting Mechanism (Here are gemma-2-9b-it Experiment Results)

Clay
2024-11-192024-11-19
AI, Machine Learning, Python, PyTorch

Over the past week, I dedicated some time to reproducing the Self-Speculative Decoding mechanism based on the ideas from the paper Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding, implementing the following modules:

A Decoder-only Transformer model with layer skipping (based on Llama and Gemma-2 architectures)
Adaptive Draft Exit Mechanism
Bayesian Optimization to discover the best layer-skipping strategy (optimizing draft model configurations)
Self-Speculative Decoding — achieving acceleration purely through the model itself

Clay
2024-11-022024-11-02
Python

I have recently set up numerous backend API servers for Chatbots. Initially, I received user messages and returned the entire LLM-generated reply in one go to the frontend interface. However, this approach did not provide a good user experience. I then switched to HTTP streaming, sending each generated token to the frontend as it was produced. Later, I found that some users' devices experienced packet sticking, so I finally switched to using WebSocket.

Clay
2024-10-292024-10-29
AI, Machine Learning

When implementing various services through LLMs, do you worry about uncontrolled language generation? Recently, at a critical juncture in wrapping up a project, I used tools like Outlines to constrain LLM decoding, which effectively controlled the model's output to follow the desired patterns. However, a colleague posed a deep question: What if I want it not to generate specific words?

Clay
2024-10-272024-10-27
Python

bisect is a built-in Python module, primarily designed to maintain the order of a sorted list, allowing items to be inserted without the need to re-sort the entire list.

Clay
2024-10-262024-10-26
Linux, Python

Introduction

Hydra is an open-source Python framework designed to simplify the research and deployment process, especially for complex applications. Hydra dynamically creates hierarchical configuration files during deployment and allows command line-based overwriting of these configurations.

Clay
2024-10-132024-10-13
Machine Learning, Python

What is KL Divergence?

In machine learning, we often encounter the term KL Divergence (also known as Kullback-Leibler Divergence). KL Divergence is a metric used to evaluate the difference between two probability distributions P and Q.

Clay
2024-10-102024-10-23
Python

Locust is an open-source load testing tool that helps simulate heavy user traffic on web applications and APIs. Compared to traditional load testing tools, Locust offers more customization and scalability—it supports Python as the scripting language, allowing us to write tests specific to our API or web service use cases.

Clay
2024-10-082024-10-08
AI, Machine Learning, PyTorch

A multi-modal large language model (Multi-Modal Large Language Model) isn’t limited to text only. I know this might sound contradictory, but this is a term that has become widely accepted. What I want to document today is how to fine-tune a multi-modal model using a script.

Clay
2024-10-042024-10-04
Python

Introduction

Recently, while handling some work-related matters, I noticed that the client might potentially need a way to extract text from PPT files. I discussed this with the PM and my supervisor, and they mentioned that the client could simply copy the text from the PPT slides manually. Unless the client explicitly requests us to extract it programmatically.

Clay
2024-10-032024-10-03
Machine Learning, Python, Scikit Learn

The first time I heard about Vector Quantization (VQ) was from a friend who was working on audio processing, which gave me a vague understanding that VQ is a technique used for data feature compression and representation. At that time, I still wasn't clear on how it differed from dimensionality reduction techniques like PCA.

Python

Self-Speculative Decoding Implementation: LayerSkip Model, Bayesian Optimization, and Adaptive Draft-Exiting Mechanism (Here are gemma-2-9b-it Experiment Results)

[Python] FastAPI Using Server-Sent Events (SSE) for Streaming Responses

Using Finite State Machine (FSM) and Rollback Mechanism to Restrict LLM from Generating Banned Words

[Python] Array Bisection Algorithm bisect Note

Note Of Hydra: Environment Configure Manager Package

Introduction

[Machine Learning] Note Of Kullback-Leibler Divergence

What is KL Divergence?

[Python] Using Locust Open Source Load Testing Framework for Stress Testing

Notes on Fine-Tuning a Multi-Modal Large Language Model Using SFTTrainer (Taking LLaVa-1.5 as an Example)

[Python] Extracting Text from PPT Using the python-pptx Library

Introduction

[Machine Learning] Vector Quantization (VQ) Notes