Speculative Decoding Implementation Note (with Simple Experimental Results)
Last Updated on 2024-11-09 by Clay
Introduction
Speculative Decoding is an extremely practical inference acceleration technique that enables a small model (draft model) to rapidly decode multiple tokens and retain the probability distribution of this process. Then, the larger target model, which we aim to accelerate, predicts the next token based on this draft. For each token position, the draft model’s probability distributions are computed and validated using the target model’s probabilities, accepting the tokens decoded by the draft model if they are deemed sufficiently reliable.
Read More »Speculative Decoding Implementation Note (with Simple Experimental Results)