Skip to content

November 16, 2024

[Paper Reading] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Last Updated on 2024-11-16 by Clay

Highlights of This Paper

  • Quantization, pruning, and distillation can also accelerate models, but come with issues like changes in output distribution compared to the original model, as well as the cost of retraining.
  • The original Speculative Decoding faces the issue of requiring additional memory to run the draft model, whereas Self-Speculative Decoding uses part of its own neural network as the draft model.
  • The Adaptive Draft-Exiting Mechanism can automatically adjust the number of tokens predicted by the draft model based on confidence score thresholds.
Read More »[Paper Reading] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding