Python

在 TensorRT-LLM Python Session 上支援 Hydra Speculative Decoding

Clay
2025-06-302025-07-01
AI, Machine Learning, Python

介紹

之前我閱讀過許多不同的 Speculative Decoding 加速推理技巧，也嘗試使用 PyTorch 實現了幾種不同的架構，包括模型架構、訓練與推理等腳本（fast-llm-inference），這一次當然又是新的目標。

LeetCode: 1749. Maximum Absolute Sum of Any Subarray 解題紀錄

Clay
2025-02-262025-02-26
C++, LeetCode, Python

LeetCode: 1079. Letter Tile Possibilities 解題紀錄

Clay
2025-02-172025-02-17
LeetCode

題目

You have n tiles, where each tile has one letter tiles[i] printed on it.

Return the number of possible non-empty sequences of letters you can make using the letters printed on those tiles.

LeetCode: 108. Convert Sorted Array to Binary Search Tree 解題紀錄

Clay
2024-11-252024-11-25
C++, LeetCode, Python

題目

Given an integer array nums where the elements are sorted in ascending order, convert it to a height-balanced binary search tree.

Self-Speculative Decoding 完整實作: LayerSkip Model, Bayesian Optimization, and Adaptive Draft-Exiting Mechanism（附 gemma-2-9b-it 實驗結果）

Clay
2024-11-172024-11-17
AI, Machine Learning, Python, PyTorch

在過去的一週裡，我抽空按照論文 Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding 的思路嘗試復現了一遍自推測性解碼（Self-Speculative Decoding），包含以下模組：

跳層解碼的 Decoder-only Transformer 模型（主要以 Llama 和 Gemma-2 兩種架構為主）
自適應草稿離開機制
貝氏優化探索最佳跳層策略（尋找怎樣的搭配才會是最好的草稿模型）
Self-Speculative Decoding —— 完成只靠模型自身的加速

[Python] FastAPI 使用 Server-Sent Events (SSE) 進行串流回覆

Clay
2024-10-312024-11-01
Python

最近建立了許多 Chatbot 的後台 API Server，一開始我是接收到使用者的訊息後回傳，將 LLM 的生成回覆一口氣顯示在前端界面，但這樣使用者體驗並不好；之後改成了 HTTP 串流，每生成一個 Token 就回傳前端界面，但後來發現在部份使用者的裝置上會發生黏包，所以最後改成了使用 WebSocket。