Speculative Decoding Implementation Note (with Simple Experimental Results)

Last Updated on 2024-11-09 by Clay Introduction Speculative Decoding is an extremely practical inference acceleration technique that enables a small model (draft model) to rapidly decode multiple tokens and retain the probability distribution of this process. Then, the larger target model, which we aim to accelerate, predicts the next token based on this draft. For … Continue reading Speculative Decoding Implementation Note (with Simple Experimental Results)