LLM Fine-tuning Note - Differences Between SFT and DPO

Last Updated on 2024-08-02 by Clay

Introduction

In the fine-tuning tasks of Large Language Models (LLM), several methods such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) are all viable approaches. However, there are some differences among them.

In the classic PPO training method, LLM training is divided into the following steps:

Unsupervised Learning: Allow the LLM to understand various knowledge through predicting masked tokens or sentence continuation, enabling the model to comprehend diverse information through text
Supervised Fine-tuning: Provide training data with standard answers for the domains we want the model to learn. For LLMs, this step is to develop their “conversational” ability
RLHF: After the LLM can converse fluently, to teach the model human values and understand what should and shouldn’t be said, we attempt to train a Reward Model (RM) to guide the LLM to respond in ways that align with human preferences

DPO is a training method that emerged to replace RLHF. For details, you can refer to my previous article “Notes on Direct Preference Optimization (DPO) Training Method“.

However, in my daily work, I encountered a project that required fine-tuning a model, and the task for the LLM in this project was “reading comprehension and responding to users” — in simple terms, performing RAG. Regarding RAG, I have a previous article discussing Self-RAG: “[Paper Review] Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection“.

But after preparing the training data, I became somewhat confused: Should I use SFT to train the model? Or should I use DPO?

Differences between SFT and DPO

To understand this issue, I began to think about the purposes and characteristics of SFT and DPO.

SFT:

SFT explicitly trains the LLM on annotated datasets, creating clear correspondences between specific inputs and outputs
SFT is used to improve the model’s ability to follow instructions and increase accuracy
SFT has a lower training cost compared to DPO (DPO requires an additional reference model)
SFT enhances the model’s understanding and generation capabilities in specific domains

DPO:

DPO uses the LLM’s own base model as a reference model (reward model), optimizing response strategies by comparing model responses with human-preferred responses
DPO adjusts model responses based on human preferences, making them more aligned with standards and expectations
Requires higher computational resources
DPO can help LLMs maintain specific response styles or generate content adhering to specific ethical standards

Thinking about it this way, I seem to have a clearer understanding: If I want to improve the LLM’s response performance, I should use SFT; if I need to adhere to specific values, I should use DPO (for instance, when a customer asks the LLM if Taiwan is a country, models in the Taiwan region must answer YES).

References

Direct Preference Optimization (DPO) Training Note

Meta-llama–Prompt-Guard-86M: Open-Source Model for Prompt Protection, Detecting Malicious Attacks

2 thoughts on “LLM Fine-tuning Note – Differences Between SFT and DPO”

Anonymous 2025-02-17 at 11:32


Hey that’s pretty nice, this is my conclusion too. I would frame it as: If I want the model to learn something: SFT if I want to encourage existing capabilities in the model to be more widely adopted: DPO.
1. Clay 2025-02-26 at 05:21
  
  
  It sounds great!
  BTW, recently I study GRPO, it’s another good strategy for LLM tuning.
  Are there any new technologies you’re interested in lately?

LLM Fine-tuning Note – Differences Between SFT and DPO

Introduction

Differences between SFT and DPO

References

Read More

相關

2 thoughts on “LLM Fine-tuning Note – Differences Between SFT and DPO”

Leave a ReplyCancel reply

LLM Fine-tuning Note – Differences Between SFT and DPO

Introduction

Differences between SFT and DPO

References

Read More

Share this:

相關

2 thoughts on “LLM Fine-tuning Note – Differences Between SFT and DPO”

Leave a ReplyCancel reply