Skip to content

LLM Fine-tuning Note - Differences Between SFT and DPO

Last Updated on 2024-08-02 by Clay

Introduction

In the fine-tuning tasks of Large Language Models (LLM), several methods such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) are all viable approaches. However, there are some differences among them.

In the classic PPO training method, LLM training is divided into the following steps:

  1. Unsupervised Learning: Allow the LLM to understand various knowledge through predicting masked tokens or sentence continuation, enabling the model to comprehend diverse information through text
  2. Supervised Fine-tuning: Provide training data with standard answers for the domains we want the model to learn. For LLMs, this step is to develop their "conversational" ability
  3. RLHF: After the LLM can converse fluently, to teach the model human values and understand what should and shouldn't be said, we attempt to train a Reward Model (RM) to guide the LLM to respond in ways that align with human preferences

DPO is a training method that emerged to replace RLHF. For details, you can refer to my previous article "Notes on Direct Preference Optimization (DPO) Training Method".

However, in my daily work, I encountered a project that required fine-tuning a model, and the task for the LLM in this project was "reading comprehension and responding to users" — in simple terms, performing RAG. Regarding RAG, I have a previous article discussing Self-RAG: "[Paper Review] Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection".

But after preparing the training data, I became somewhat confused: Should I use SFT to train the model? Or should I use DPO?


Differences between SFT and DPO

To understand this issue, I began to think about the purposes and characteristics of SFT and DPO.

SFT:

  • SFT explicitly trains the LLM on annotated datasets, creating clear correspondences between specific inputs and outputs
  • SFT is used to improve the model's ability to follow instructions and increase accuracy
  • SFT has a lower training cost compared to DPO (DPO requires an additional reference model)
  • SFT enhances the model's understanding and generation capabilities in specific domains


DPO:

  • DPO uses the LLM's own base model as a reference model (reward model), optimizing response strategies by comparing model responses with human-preferred responses
  • DPO adjusts model responses based on human preferences, making them more aligned with standards and expectations
  • Requires higher computational resources
  • DPO can help LLMs maintain specific response styles or generate content adhering to specific ethical standards


Thinking about it this way, I seem to have a clearer understanding: If I want to improve the LLM's response performance, I should use SFT; if I need to adhere to specific values, I should use DPO (for instance, when a customer asks the LLM if Taiwan is a country, models in the Taiwan region must answer YES).


References


Read More

Leave a Reply