[Paper Reading] Fast Inference from Transformers via Speculative Decoding
Last Updated on 2024-11-06 by Clay Abstract In auto-regressive model decoding, if we need to decode K tokens, we must go through the process K times, which is the current bottleneck in the inference time of large language models. The paper reviewed here presents a method called Speculative Decoding, an accelerated inference algorithm that leverages … Continue reading [Paper Reading] Fast Inference from Transformers via Speculative Decoding
Copy and paste this URL into your WordPress site to embed
Copy and paste this code into your site to embed