Skip to content

[Paper Reading] Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

Last Updated on 2024-07-25 by Clay

Introduction

The accelerated framework is proposed by Huawei Noah's Ark Lab, it replaces the small model used in the original speculative decoding with the shallow sub-network of the large model. Additionally, it employs an extra-trained adapter and the model’s own decoding head to generate speculative tokens, which are then verified by the large model. The subsequent operations are quite similar to the original speculative decoding process.


Background

To explain the acceleration method of Kangaroo, we first need to introduce the concept of speculative decoding.

Speculative decoding is a proven technique for accelerating model inference. The basic idea is to use a smaller and faster draft model (drafter) to quickly generate some speculative decoding results, which are then verified by the large model.

For example, the drafter might generate a series of decoding results like "Today", "is", "a", "nice", "good" in one go, and then the large model would predict the following results in one shot:

  • "" -> "Today" (O)
  • "Today" -> "is" (O)
  • "Today is" -> "a" (O)
  • "Today is a" -> "nice" (O)
  • "Today is a nice" -> "day" (X): originally the drafter predicted "good"

These 5 steps are completed in parallel by the large model. However, if any of the drafter's predicted tokens are rejected by the large model, they will be replaced with the large model's prediction, and the drafter will continue to predict subsequent tokens. This reduces the time taken for the large model to sequentially decode 5 times into a single inference step.

However, due to the parallel decoding of the drafter and the large model, the GPU VRAM usage is higher than decoding with the large model alone. To some extent, it’s a strategy of trading space for time.

Having explained the concept of speculative decoding, we can now move on to introduce Kangaroo's method.


Kangaroo's Method

Consistent Token Acceptance Rate (CTAR)

The architecture proposed by Kangaroo is interesting, but before diving into it, let’s first discuss the new evaluation metric introduced by the researchers.

Typically, speculative decoding uses two metrics: wall-time speedup ratio and compression rate. However, the Kangaroo research team pointed out that these metrics do not reflect the token acceptance rate of the drafter model in different contexts.

Compression\ Rate\ (CR) = \frac{Accepted\ Draft\ Tokens}{Total\ Tokens}


The research team introduced a new evaluation metric called Consistent Token Acceptance Rate (CTAR), which measures the probability that all tokens predicted by the drafter model are accepted by the target model given a prefix and a subsequent window size.

Consistent\ Token\ Acceptance\ Rate\ (CTAR) = \frac{Accepted\ Draft\ Windows}{Total\ Windows}

Intuitively, the CTAR score is expected to decrease as the window size increases.


Early Exiting Mechanism

In Kangaroo, the drafter model is not a separate small model as in the original speculative decoding method. Instead, the shallow sub-network of the target model (large model) serves as part of the drafter model, with an additional adapter network and the final decoding layer (LM Head) of the model forming the drafter model. This significantly reduces the parameter size of the drafter model.

In practice, only the parameters of the adapter network are added.

Next is the early exiting mechanism proposed by the Kangaroo architecture. The early exiting mechanism means that when the drafter model's confidence in the current token prediction is below a certain threshold, it stops generating tokens early and hands over all current speculative tokens to the target model for evaluation. This effectively assigns difficult decoding tasks, which the drafter model is uncertain about, to the target model.

The idea is that if the drafter model lacks confidence in its predictions, it’s better to hand over to the target model early. Even if the drafter model spends time decoding, the likelihood of rejection by the target model is still high.

The detailed data flow can be confirmed from the architecture diagram above. It is worth noting that training the Kangaroo architecture model only involves training the adapter network (usually consisting of just two normalization layers and a multi-head attention mechanism), making it a low-cost fine-tuning task.


Experiment Results

In experiments, Kangaroo showed significant acceleration effects on Spec-Bench, achieving a 1.68x speedup while requiring 88.7% fewer additional parameters than other methods.


Conclusion

Since the source code for Kangaroo has been open-sourced on GitHub (please refer to the links below), I have reviewed it at an early stage. However, as of today (2024-06-03), the training scripts have still not been released.

Of course, since the Kangaroo architecture is already established, it is feasible to use this architecture for training directly. Honestly, I am quite eager to test it myself, especially since my personal goal for this year is to research accelerated inference techniques.

However, before that, I would like to complete the implementation of Medusa first. I had started on it, but had to put it aside temporarily due to my busy work schedule.

Acceleration inference techniques rely heavily on existing frameworks. Simply improving the architecture itself is not enough for practical application; an optimized and deeply integrated framework is essential.

Therefore, reading the source code of various accelerated inference frameworks is also one of the tasks.


References


Read More

Leave a Reply