Skip to content

December 2024

Differences and Comparison Between KL Divergence and Cross Entropy

Last Updated on 2024-12-03 by Clay

Introduction

Recently, while implementing the paper Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, I encountered a question about its use of Cross Entropy Loss to align the probability distributions of the draft model and the target model. Why not use KL Divergence instead?

Read More »Differences and Comparison Between KL Divergence and Cross Entropy