Last Updated on 2024-11-16 by Clay
Highlights of This Paper
- Quantization, pruning, and distillation can also accelerate models, but come with issues like changes in output distribution compared to the original model, as well as the cost of retraining.
- The original Speculative Decoding faces the issue of requiring additional memory to run the draft model, whereas Self-Speculative Decoding uses part of its own neural network as the draft model.
- The Adaptive Draft-Exiting Mechanism can automatically adjust the number of tokens predicted by the draft model based on confidence score thresholds.
Abstract
Researchers have proposed a variation called Self-Speculative Decoding based on the original Speculative Decoding.
The original Speculative Decoding can be divided into two stages: drafting and verification.
During the drafting stage, a smaller draft model, which shares the same vocabulary as the target model we aim to accelerate, generates tokens. Since the draft model is much faster compared to the target model, it can quickly decode multiple tokens and provide a probability distribution for the predicted tokens at each sequence position.
In the verification stage, the target model takes over the draft model's sequence and predicts the next token. While predicting the next token, it also generates logits for the previous tokens, which can then be transformed into a probability distribution and compared with the draft model's predicted sequence for verification.
For more details on the verification process, you can refer to my previous notes: [Paper Review] Fast Inference from Transformers via Speculative Decoding
Self-Speculative Decoding, on the other hand, eliminates the need for an additional draft model, using part of its own network instead.
Self-Speculative Decoding Experiment Methodology
I have skipped over the background information provided in the paper, as it largely overlaps with the Speculative Decoding that I've reviewed previously. I will focus solely on the highlights of Self-Speculative Decoding.
Bayesian Optimization for Layer Skipping Strategy
Since we aim to skip certain layers during computation and use the modified model as the draft model, it is essential to carefully select which layers to skip.
Intuitively, if we skip no layers, the acceptance rate (AR) would almost certainly be 100%. On the other hand, if we reduce the model to just one layer, although the draft model would be incredibly fast, the acceptance rate would likely drop to nearly 0%.
It’s worth mentioning that the layers being skipped can be either Attention Layers or MLP Layers, even though MLP Layers typically follow Attention Layers. These layers can be independently considered for skipping without the need to always bind them together.
The method used in the paper is Bayesian Optimization, which is a highly appropriate choice since it allows us to use Bayesian principles to search for the optimal combination of parameters that maximizes the objective function.
Here, the parameters refer to the combination of layers to skip, while the objective function represents the average inference time per token, which we aim to minimize.
Adaptive Draft-Exiting Mechanism
In the verification mechanism of Speculative Decoding, if just one token from the sequence predicted by the draft model is rejected, all subsequent tokens generated by the draft model will also be discarded. This means that predicting too many tokens with the draft model may be counterproductive for acceleration.
A simple solution to this is to introduce a threshold during the draft model's decoding process. If the confidence score of a predicted token is lower than , we stop the speculative decoding.
However, a static value might not be ideal in practical applications, since the target model is also verifying the draft model’s sequence based on its own output probability distribution. In some cases, both models may be uncertain, leading to overly strict rejection of the draft model's predictions.
Therefore, the Adaptive Draft-Exiting Mechanism adjusts the confidence threshold dynamically based on the acceptance rate (AR).
Parameter definitions are as follows:
- : Desired acceptance rate. If the actual rate falls below this, the confidence threshold for the draft model is increased, requiring a higher confidence score to proceed with predictions.
- : Step size for updating .
- : Current acceptance rate, usually considering the results from previous rounds as well.
- : The new adjusted threshold direction, which will be further smoothed using other parameters to determine the final threshold.
- : Weight for the current acceptance rate, allowing a smooth transition from the previous acceptance rate to the current one.
- : Weight for the confidence threshold, also allowing a smooth transition to avoid abrupt changes.
This approach enables the implementation of an adaptive draft-exiting mechanism.
Experimental Results
Main Acceleration Results
The experimental results are quite promising, aligning with our intuition: larger models can better tolerate the effects of layer skipping. For example, a 70B parameter model achieved the greatest acceleration, reaching up to 1.992x.
Less is More: Skipping Too Many Layers Can Backfire
Another figure clearly demonstrates that skipping more layers doesn't always yield better acceleration results. If the acceptance rate drops too low, it can even lead to slower inference times.
Adaptive Draft-Exiting Mechanism Experimental Results
More importantly, the experiments showed that the Adaptive Draft-Exiting Mechanism is indeed effective. Whether compared with the fixed-K prediction or a static confidence threshold, the adaptive mechanism consistently performed among the best.
Conclusion
To me, this paper is especially significant. Back when I first read about Speculative Decoding in early September 2023, I envisioned something similar to Self-Speculative Decoding. However, in my mind, I imagined using genetic algorithms to select which layers to skip, rather than Bayesian optimization.
Not long after, I saw the research on Self-Speculative Decoding come out. It felt as if my ideas had been acknowledged, which is a rare and precious feeling, especially since I am well aware of my own limitations.
It’s like, after reading enough papers, even a less mature version of myself has started to see research from a similar perspective as many other researchers.
From a technical point of view, I also see great value in this study. Maintaining both a draft model and a target model can be quite cumbersome. If future models can seamlessly use parts of their own neural networks for draft predictions, wouldn’t that be incredibly convenient?
Maybe in the future, applying such a technique will become the standard, and not using it will be the exception.
I also wonder whether layer-level skipping is the ultimate answer. Could we, instead, apply pruning, sparsification, layer skipping, and weight sharing in various combinations to construct a model from the original target model? Just like Michelangelo once said: "The statue is already in the stone, I just need to remove the unnecessary parts."
Anyway, these are just some thoughts. I will still focus on implementing Self-Speculative Decoding as soon as possible.
If you’re interested, feel free to follow my GitHub project on accelerated inference: https://github.com/ccs96307/fast-llm-inference
I will keep adding interesting implementations of accelerated inference techniques to this project, so feel free to discuss anytime!
References
- Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
- GitHub - dilab-zju/self-speculative-decoding