[Paper Reading] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
Highlights of This Paper
- Quantization, pruning, and distillation can also accelerate models, but come with issues like changes in output distribution compared to the original model, as well as the cost of retraining.
- The original Speculative Decoding faces the issue of requiring additional memory to run the draft model, whereas Self-Speculative Decoding uses part of its own neural network as the draft model.
- The Adaptive Draft-Exiting Mechanism can automatically adjust the number of tokens predicted by the draft model based on confidence score thresholds.