Using The Target Model's Confidence Threshold To Decide Whether To Enable Speculative Decoding
Last Updated on 2024-11-22 by Clay
Many of the inference acceleration techniques I have studied, such as Speculative Decoding, predominantly use a threshold for the confidence scores of the draft model. This threshold determines how many draft tokens should be decoded before passing them to the target model for verification, thereby reducing the extra computational cost when the draft model operates with low confidence.
Read More »Using The Target Model's Confidence Threshold To Decide Whether To Enable Speculative Decoding