[Paper Reading] Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

Last Updated on 2024-07-25 by Clay Introduction The accelerated framework is proposed by Huawei Noah’s Ark Lab, it replaces the small model used in the original speculative decoding with the shallow sub-network of the large model. Additionally, it employs an extra-trained adapter and the model’s own decoding head to generate speculative tokens, which are then … Continue reading [Paper Reading] Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting