January 2, 2024

Direct Preference Optimization (DPO) Training Note

Last Updated on 2024-01-02 by Clay

DPO (Direct Preference Optimization) is a fine-tuning method that want to replace RLHF (Reinforcement Learning from Human Feedback).