Self-Speculative Decoding Implementation: LayerSkip Model, Bayesian Optimization, and Adaptive Draft-Exiting Mechanism (Here are gemma-2-9b-it Experiment Results)
Last Updated on 2024-11-19 by Clay
Over the past week, I dedicated some time to reproducing the Self-Speculative Decoding mechanism based on the ideas from the paper Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding, implementing the following modules:
- A Decoder-only Transformer model with layer skipping (based on Llama and Gemma-2 architectures)
- Adaptive Draft Exit Mechanism
- Bayesian Optimization to discover the best layer-skipping strategy (optimizing draft model configurations)
- Self-Speculative Decoding — achieving acceleration purely through the model itself