Differences and Comparison Between KL Divergence and Cross Entropy
Introduction
Recently, while implementing the paper Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, I encountered a question about its use of Cross Entropy Loss to align the probability distributions of the draft model and the target model. Why not use KL Divergence instead?
Read More »Differences and Comparison Between KL Divergence and Cross Entropy