Speculative Decoding Implementation Note (with Simple Experimental Results)
Introduction
Speculative Decoding is an extremely practical inference acceleration technique that enables a small model (draft model) to rapidly decode multiple tokens and retain the probability distribution of this process. Then, the larger target model, which we aim to accelerate, predicts the next token based on this draft. For each token position, the draft model’s probability distributions are computed and validated using the target model's probabilities, accepting the tokens decoded by the draft model if they are deemed sufficiently reliable.
Read More »Speculative Decoding Implementation Note (with Simple Experimental Results)