[Paper Reading] Fast Inference from Transformers via Speculative Decoding
Abstract
In auto-regressive model decoding, if we need to decode K tokens, we must go through the process K times, which is the current bottleneck in the inference time of large language models.
Read More »[Paper Reading] Fast Inference from Transformers via Speculative Decoding