Python

Supporting Hydra Speculative Decoding on TensorRT-LLM Python Session

Clay
2025-07-012025-07-01
AI, Machine Learning, Python

Introduction

I’ve previously studied many different speculative decoding acceleration techniques and attempted to implement several architectures using PyTorch, including model architecture, training, and inference scripts (fast-llm-inference). This time, of course, I have a new goal.

Self-Speculative Decoding Implementation: LayerSkip Model, Bayesian Optimization, and Adaptive Draft-Exiting Mechanism (Here are gemma-2-9b-it Experiment Results)

Clay
2024-11-192024-11-19
AI, Machine Learning, Python, PyTorch

Over the past week, I dedicated some time to reproducing the Self-Speculative Decoding mechanism based on the ideas from the paper Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding, implementing the following modules:

A Decoder-only Transformer model with layer skipping (based on Llama and Gemma-2 architectures)
Adaptive Draft Exit Mechanism
Bayesian Optimization to discover the best layer-skipping strategy (optimizing draft model configurations)
Self-Speculative Decoding — achieving acceleration purely through the model itself

[Python] FastAPI Using Server-Sent Events (SSE) for Streaming Responses

Clay
2024-11-022024-11-02
Python

I have recently set up numerous backend API servers for Chatbots. Initially, I received user messages and returned the entire LLM-generated reply in one go to the frontend interface. However, this approach did not provide a good user experience. I then switched to HTTP streaming, sending each generated token to the frontend as it was produced. Later, I found that some users’ devices experienced packet sticking, so I finally switched to using WebSocket.

Using Finite State Machine (FSM) and Rollback Mechanism to Restrict LLM from Generating Banned Words

Clay
2024-10-292024-10-29
AI, Machine Learning

When implementing various services through LLMs, do you worry about uncontrolled language generation? Recently, at a critical juncture in wrapping up a project, I used tools like Outlines to constrain LLM decoding, which effectively controlled the model’s output to follow the desired patterns. However, a colleague posed a deep question: What if I want it not to generate specific words?

[Python] Array Bisection Algorithm bisect Note

Clay
2024-10-272024-10-27
Python

bisect is a built-in Python module, primarily designed to maintain the order of a sorted list, allowing items to be inserted without the need to re-sort the entire list.

Note Of Hydra: Environment Configure Manager Package

Clay
2024-10-262024-10-26
Linux, Python

Introduction

Hydra is an open-source Python framework designed to simplify the research and deployment process, especially for complex applications. Hydra dynamically creates hierarchical configuration files during deployment and allows command line-based overwriting of these configurations.

[Machine Learning] Note Of Kullback-Leibler Divergence

Clay
2024-10-132024-10-13
Machine Learning, Python

What is KL Divergence?

In machine learning, we often encounter the term KL Divergence (also known as Kullback-Leibler Divergence). KL Divergence is a metric used to evaluate the difference between two probability distributions P and Q.

[Python] Using Locust Open Source Load Testing Framework for Stress Testing

Clay
2024-10-102024-10-23
Python

Locust is an open-source load testing tool that helps simulate heavy user traffic on web applications and APIs. Compared to traditional load testing tools, Locust offers more customization and scalability—it supports Python as the scripting language, allowing us to write tests specific to our API or web service use cases.

Notes on Fine-Tuning a Multi-Modal Large Language Model Using SFTTrainer (Taking LLaVa-1.5 as an Example)

Clay
2024-10-082024-10-08
AI, Machine Learning, PyTorch

A multi-modal large language model (Multi-Modal Large Language Model) isn’t limited to text only. I know this might sound contradictory, but this is a term that has become widely accepted. What I want to document today is how to fine-tune a multi-modal model using a script.

[Python] Extracting Text from PPT Using the python-pptx Library

Clay
2024-10-042024-10-04
Python

Introduction

Recently, while handling some work-related matters, I noticed that the client might potentially need a way to extract text from PPT files. I discussed this with the PM and my supervisor, and they mentioned that the client could simply copy the text from the PPT slides manually. Unless the client explicitly requests us to extract it programmatically.