December 3, 2024

Differences and Comparison Between KL Divergence and Cross Entropy

Clay
2024-12-032024-12-03
Machine Learning

Last Updated on 2024-12-03 by Clay

Introduction

Recently, while implementing the paper Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, I encountered a question about its use of Cross Entropy Loss to align the probability distributions of the draft model and the target model. Why not use KL Divergence instead?

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31