Direct Preference Optimization (DPO) Training Note

Last Updated on 2024-01-02 by Clay Introduction DPO (Direct Preference Optimization) is a fine-tuning method that want to replace RLHF (Reinforcement Learning from Human Feedback). Everybody knows, The large language models can learn about many knowledge and understanding capability after unsupervised learning (some researchers mentioned that only compress and store the information in the weights); … Continue reading Direct Preference Optimization (DPO) Training Note