[Paper Reading] Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Last Updated on 2024-07-25 by Clay
Introduction
The accelerated framework is proposed by Huawei Noah's Ark Lab, it replaces the small model used in the original speculative decoding with the shallow sub-network of the large model. Additionally, it employs an extra-trained adapter and the model’s own decoding head to generate speculative tokens, which are then verified by the large model. The subsequent operations are quite similar to the original speculative decoding process.
Read More »[Paper Reading] Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting