Last Updated on 2024-08-02 by Clay
Introduction
In the fine-tuning tasks of Large Language Models (LLM), several methods such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) are all viable approaches. However, there are some differences among them.
In the classic PPO training method, LLM training is divided into the following steps:
- Unsupervised Learning: Allow the LLM to understand various knowledge through predicting masked tokens or sentence continuation, enabling the model to comprehend diverse information through text
- Supervised Fine-tuning: Provide training data with standard answers for the domains we want the model to learn. For LLMs, this step is to develop their "conversational" ability
- RLHF: After the LLM can converse fluently, to teach the model human values and understand what should and shouldn't be said, we attempt to train a Reward Model (RM) to guide the LLM to respond in ways that align with human preferences
DPO is a training method that emerged to replace RLHF. For details, you can refer to my previous article "Notes on Direct Preference Optimization (DPO) Training Method".
However, in my daily work, I encountered a project that required fine-tuning a model, and the task for the LLM in this project was "reading comprehension and responding to users" — in simple terms, performing RAG. Regarding RAG, I have a previous article discussing Self-RAG: "[Paper Review] Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection".
But after preparing the training data, I became somewhat confused: Should I use SFT to train the model? Or should I use DPO?
Differences between SFT and DPO
To understand this issue, I began to think about the purposes and characteristics of SFT and DPO.
SFT:
- SFT explicitly trains the LLM on annotated datasets, creating clear correspondences between specific inputs and outputs
- SFT is used to improve the model's ability to follow instructions and increase accuracy
- SFT has a lower training cost compared to DPO (DPO requires an additional reference model)
- SFT enhances the model's understanding and generation capabilities in specific domains
DPO:
- DPO uses the LLM's own base model as a reference model (reward model), optimizing response strategies by comparing model responses with human-preferred responses
- DPO adjusts model responses based on human preferences, making them more aligned with standards and expectations
- Requires higher computational resources
- DPO can help LLMs maintain specific response styles or generate content adhering to specific ethical standards
Thinking about it this way, I seem to have a clearer understanding: If I want to improve the LLM's response performance, I should use SFT; if I need to adhere to specific values, I should use DPO (for instance, when a customer asks the LLM if Taiwan is a country, models in the Taiwan region must answer YES).
References
- Understanding and Using Supervised Fine-Tuning (SFT) for Language Models
- Direct Preference Optimization (DPO): A Simplified Approach to Fine-tuning Large Language Models
- Fine-Tuning LLMs: Supervised Fine-Tuning and Reward Modelling