[Paper Reading] Fast Inference from Transformers via Speculative Decoding

Last Updated on 2024-11-06 by Clay Abstract In auto-regressive model decoding, if we need to decode K tokens, we must go through the process K times, which is the current bottleneck in the inference time of large language models. The paper reviewed here presents a method called Speculative Decoding, an accelerated inference algorithm that leverages … Continue reading [Paper Reading] Fast Inference from Transformers via Speculative Decoding