November 2024

[Paper Reading] Fast Inference from Transformers via Speculative Decoding

Clay
2024-11-062024-11-06
AI, Machine Learning, Papers

Last Updated on 2024-11-06 by Clay

Abstract

In auto-regressive model decoding, if we need to decode K tokens, we must go through the process K times, which is the current bottleneck in the inference time of large language models.

[Python] FastAPI Using Server-Sent Events (SSE) for Streaming Responses

Clay
2024-11-022024-11-02
Python

Last Updated on 2024-11-02 by Clay

I have recently set up numerous backend API servers for Chatbots. Initially, I received user messages and returned the entire LLM-generated reply in one go to the frontend interface. However, this approach did not provide a good user experience. I then switched to HTTP streaming, sending each generated token to the frontend as it was produced. Later, I found that some users’ devices experienced packet sticking, so I finally switched to using WebSocket.

KV Cache: A Caching Mechanism To Accelerate Transformer Generation

Clay
2024-11-012024-11-01
AI, Machine Learning

Last Updated on 2024-11-01 by Clay

During the decoding process of large language models, especially in Auto-regressive models, decoding must be performed step-by-step until the entire sequence is generated. Within this process, there are caching techniques that can help reduce computation and improve decoding speed; one such technique is known as the KV Cache.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30