November 6, 2024

[Paper Reading] Fast Inference from Transformers via Speculative Decoding

Clay
2024-11-062024-11-06
AI, Machine Learning, Papers

Last Updated on 2024-11-06 by Clay

Abstract

In auto-regressive model decoding, if we need to decode K tokens, we must go through the process K times, which is the current bottleneck in the inference time of large language models.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30