[Paper Reading] Fast Inference from Transformers via Speculative Decoding
Last Updated on 2024-11-06 by Clay
Abstract
In auto-regressive model decoding, if we need to decode K tokens, we must go through the process K times, which is the current bottleneck in the inference time of large language models.
Read More »[Paper Reading] Fast Inference from Transformers via Speculative Decoding