Direct Preference Optimization (DPO) Training Note
Introduction
DPO (Direct Preference Optimization) is a fine-tuning method that want to replace RLHF (Reinforcement Learning from Human Feedback).
Read More »Direct Preference Optimization (DPO) Training Note