使用 SFTTrainer 微調多模態大型語言模型筆記（以 LLaVa-1.5 為例）

Last Updated on 2024-10-07 by Clay

多模態大型語言模型（Multi-Modal Large Language Model）是一種不侷限於文字的語言模型，我知道這聽起來很衝突，不過這算是目前大家普遍接受的一種稱呼；而我今天想要紀錄的，就是該怎麼使用一個腳本就進行多模態模型的微調。

目前我測試下來，最簡單的方式依然還是使用 HuggingFace 所開發的 TRL 框架中的 SFTTrainer()。畢竟最基本的多模態模型，其實就是能額外輸入『圖像資訊』讓語言模型生成文字；也就是說，只要我們能處理好輸入圖像的映射，我們後面的語言模型與 cross entropy 損失函數通通都一模一樣。

我以為有寫過一篇只微調單純語言模型的筆記：Supervised Fine-tuning Trainer (SFTTrainer) 訓練筆記，現在這篇筆記可以視為當初那篇筆記的擴充，是一個簡易的可以訓練多模態模型的腳本介紹。

而真要我說微調多模態模型的應用場景，我目前最感興趣的是做表格圖片的 parsing，最好是能直接產生可以使用的 Markdown 語法，並且上面有模型判斷標註好的 column、row 和 value，這樣我就可以處理多種不同格式的表格圖片了（不過，目前還在累積訓練資料的階段）。

資料格式

首先，我們先來確定訓練資料應該長什麼樣子吧！這裡是讀取 Huggingface 微調 LLaVa 的資料集進來看。

from datasets import load_dataset

dataset_name = "HuggingFaceH4/llava-instruct-mix-vsft"
dataset = load_dataset(dataset_name)

print(dataset)

Output:

DatasetDict({
    train: Dataset({
        features: ['messages', 'images'],
        num_rows: 259155
    })
    test: Dataset({
        features: ['messages', 'images'],
        num_rows: 13640
    })
})

我們可以看到，每一筆資料都有 messages 和 images 兩個欄位，messages 的部份看起來跟單純只訓練文字的語言模型的資料一模一樣：

[{'content': [{'index': None,
    'text': 'Who wrote this book?\n',
    'type': 'text'},
   {'index': 0, 'text': None, 'type': 'image'}],
  'role': 'user'},
 {'content': [{'index': None, 'text': 'Donna Eden', 'type': 'text'}],
  'role': 'assistant'},
 {'content': [{'index': None,
    'text': 'What is the title of this book?',
    'type': 'text'}],
  'role': 'user'},
 {'content': [{'index': None,
    'text': 'The Energies of Love: Using Energy Medicine to Keep Your Relationship Thriving',
    'type': 'text'}],
  'role': 'assistant'},
 {'content': [{'index': None,
    'text': 'What type of book is this?',
    'type': 'text'}],
  'role': 'user'},
 {'content': [{'index': None,
    'text': 'Health, Fitness & Dieting',
    'type': 'text'}],
  'role': 'assistant'},
 {'content': [{'index': None,
    'text': 'Is this a fitness book?',
    'type': 'text'}],
  'role': 'user'},
 {'content': [{'index': None, 'text': 'Yes', 'type': 'text'}],
  'role': 'assistant'}]

而在 {'index': 0, 'text': None, 'type': 'image'} 則標示著這裡是放置圖片的地方，並且有圖片的索引（畢竟圖片可能不只一張）。

而在 images 欄位的內，則是一個陣列，陣列中直接儲存著 PIL 格式的圖片。

也就是說準備好這樣的資料，就可以訓練一個圖像 + 文字的多模態語言模型。

訓練腳本

以下我分段敘述我的腳本構成（基本上也是從 HuggingFace 微調的腳本修改過來的）。

首先，匯入所有我會需要使用到的套件。

# When training LLaVa-1.5, we have to use:
# ```
# NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 python3 sft_trainer_vlm.py (Recommend)
# ```
#
# or
#
# ```
# accelerate launch sft_trainer_vlm.py
# ```

import torch
from datasets import load_dataset

from peft import LoraConfig
from transformers import AutoModelForVision2Seq, AutoProcessor, LlavaForConditionalGeneration, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer

接著，則是把我訓練時會使用到的參數都設定好：

# Settings
dataset_name = "HuggingFaceH4/llava-instruct-mix-vsft"
model_name_or_path = "models/llava-hf--llava-1.5-7b-hf/"
output_dir = "checkpoints/any_chatbot_20241007_llava_1.5/"

sft_config = SFTConfig(
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    learning_rate=6e-6,
    lr_scheduler_type="cosine",
    max_steps=10000,
    evaluation_strategy="steps",
    save_strategy="steps",
    do_eval=True,
    eval_steps=100,
    save_steps=100,
    logging_steps=1,
    output_dir=output_dir,
    optim="paged_adamw_32bit",
    warmup_steps=100,
    remove_unused_columns=False,
    bf16=False,
    fp16=True,
    report_to="none",
    metric_for_best_model="eval_loss",
    load_best_model_at_end=True,
    save_only_model=True,
    neftune_noise_alpha=5,
    dataset_kwargs={"skip_prepare_dataset": True}  # Must to set
)

以下是 SFTConfig 內所有參數的逐一解釋：

per_device_train_batch_size=1：每個設備（如 GPU）上的訓練 batch 大小；這裡設定為 1，表示每次更新的樣本數量為 1
per_device_eval_batch_size=1：每個設備上的評估 batch 大小；這裡同樣設置為 1，表示評估階段每次處理 1 個樣本
gradient_accumulation_steps=4：梯度累積的步數。當 batch size 受限於硬件內存時，這個參數允許我們在累積 4 個 batch 的梯度後再進行一次模型權重的更新，相當於增加了有效的批次大小
gradient_checkpointing=True：啟用梯度檢查點，以減少內存使用；這會在前向傳播時針對某些層進行存儲節省，在需要時重新計算這些層的梯度
learning_rate=6e-6：訓練的初始學習率，這裡設置為 6e-6。這個值決定了模型參數更新的步伐
lr_scheduler_type="cosine"：學習率調度器的類型。使用 cosine 曲線來隨訓練時間調整學習率，讓模型不會卡死在某個收斂點、但也可能發生意外導致收斂不了，選擇 cosine 時需要小心，我是根據實驗結果這樣選擇的
max_steps=10000：訓練的最大步數，模型將在進行 10000 步之後結束訓練
evaluation_strategy="steps"：設定何時進行模型評估，使用 steps 表示每隔一段步數進行一次評估
save_strategy="steps"：設定何時保存模型，這裡同樣使用 steps 表示每隔一定步數就保存模型
do_eval=True：是否在訓練過程中進行評估
eval_steps=100：每隔多少步進行一次評估
save_steps=100：每隔多少步保存一次模型
logging_steps=1：訓練時記錄日志的頻率
output_dir=output_dir：模型和訓練輸出的保存路徑
optim="paged_adamw_32bit"：優化器的類型，這裡使用了 paged_adamw_32bit，這是一種基於 AdamW 的優化方法，專門針對記憶體節省進行了優化
warmup_steps=100：在訓練初期的熱身步數，在這段步數期間，學習率會線性增加到設定的學習率，幫助模型穩定開始訓練
remove_unused_columns=False：是否從數據集中移除未使用的列
bf16=False：是否使用 bfloat16 精度來進行訓練。這裡設為 False，表示不使用（因為 LLaVa 本來就是使用 float16 儲存的）
fp16=True：是否使用 float16 精度來進行訓練
report_to="none"：設定日志報告的目標。設為 none，表示不向外部服務（如 TensorBoard）紀錄 log
metric_for_best_model="eval_loss"：用於選擇最佳模型的評估指標。這裡使用 "eval_loss"，表示選擇評估損失最小的模型作為最佳模型
load_best_model_at_end=True：訓練完成時是否加載最佳模型。這裡設為 True，表示訓練結束時會加載評估過程中性能最佳的模型
save_only_model=True：是否只保存模型權重（不保存整個訓練狀態）。設為 True 表示僅保存模型而不儲存優化器等訓練狀態，減少儲存空間的使用（我每隔幾天都就會被同事提醒不要讓空間爆炸）
neftune_noise_alpha=5：這是一個在 embedding 上加上高斯噪音的方法，可以提昇學習泛化能力
dataset_kwargs={"skip_prepare_dataset": True}：額外的數據集參數，這裡設置為 {"skip_prepare_dataset": True}，表示跳過資料集的準備步驟

下面則是我的 LoRA 設定與量化設定，因為我的 VRAM 較少，所以採用 QLoRA 的方式進行訓練。

quantization_config = BitsAndBytesConfig(
    load_in_4bit=False,
    bnb_4bit_compute_dtype=torch.float16,  # For consistency with model weights, we use the same value as `torch_dtype`
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.float16,
)


# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    use_dora=True,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)

接下來，則是讀取處理器、模型和資料集。

# Load Processor
processor = AutoProcessor.from_pretrained(
    model_name_or_path,
    trust_remote_code=True,
)

# Load model
model = AutoModelForVision2Seq.from_pretrained(
    model_name_or_path,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)

# Load dataset
dataset = load_dataset(dataset_name)

這裡是訓練資料讀取時，會進行的前處理。基本上，最重要的就是處理圖片了。把圖片映射好、文字都斷詞好，之後則是把同一批次（batch）不同長度的資料 padding 到同樣的長度，好組成批次同時訓練。

def collate_fn(examples):
    # Get the texts and images, and apply the chat template
    texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in examples]
    images = [example["images"] for example in examples]

    if isinstance(model, LlavaForConditionalGeneration):
        # LLava1.5 does not support multiple images
        images = [image[0] for image in images]

    # Tokenize the texts and process the images
    batch = processor(text=texts, images=images, return_tensors="pt", padding=True)

    # The labels are the input_ids, and we mask the padding tokens in the loss computation
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100  # Padding

    # Ignore the image token index in the loss computation (model specific)
    image_token_id = processor.tokenizer.convert_tokens_to_ids(processor.image_token)
    labels[labels == image_token_id] = -100
    batch["labels"] = labels

    return batch

這一切都設定完後，就可以開始訓練了。十分簡單的多模態訓練腳本，才一百多行而已（不得不說收集資料所花的時間幾乎是寫腳本的百倍）。

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    data_collator=collate_fn,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=processor.tokenizer,
    peft_config=peft_config,
)


trainer.train()

完整腳本

# When training Gemma-2-9b, we have to use:
# ```
# NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 python3 sft_trainer_vlm.py (Recommend)
# ```
#
# or
#
# ```
# accelerate launch sft_trainer_unsloth.py
# ```

import torch
from datasets import load_dataset

from peft import LoraConfig
from transformers import AutoModelForVision2Seq, AutoProcessor, LlavaForConditionalGeneration, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer


# Settings
dataset_name = "HuggingFaceH4/llava-instruct-mix-vsft"
model_name_or_path = "models/llava-hf--llava-1.5-7b-hf/"
output_dir = "checkpoints/any_chatbot_20241007_llava_1.5/"

sft_config = SFTConfig(
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    learning_rate=6e-6,
    lr_scheduler_type="cosine",
    max_steps=10000,
    evaluation_strategy="steps",
    save_strategy="steps",
    do_eval=True,
    eval_steps=100,
    save_steps=100,
    logging_steps=1,
    output_dir=output_dir,
    optim="paged_adamw_32bit",
    warmup_steps=100,
    remove_unused_columns=False,
    bf16=False,
    fp16=True,
    report_to="none",
    metric_for_best_model="eval_loss",
    load_best_model_at_end=True,
    save_only_model=True,
    neftune_noise_alpha=5,
    dataset_kwargs={"skip_prepare_dataset": True}  # Must to set
)

quantization_config = BitsAndBytesConfig(
    load_in_4bit=False,
    bnb_4bit_compute_dtype=torch.float16,  # For consistency with model weights, we use the same value as `torch_dtype`
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.float16,
)


# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    use_dora=True,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)


# Load Processor
processor = AutoProcessor.from_pretrained(
    model_name_or_path,
    trust_remote_code=True,
)

# Load model
model = AutoModelForVision2Seq.from_pretrained(
    model_name_or_path,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)

# Load dataset
dataset = load_dataset(dataset_name)


def collate_fn(examples):
    # Get the texts and images, and apply the chat template
    texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in examples]
    images = [example["images"] for example in examples]

    if isinstance(model, LlavaForConditionalGeneration):
        # LLava1.5 does not support multiple images
        images = [image[0] for image in images]

    # Tokenize the texts and process the images
    batch = processor(text=texts, images=images, return_tensors="pt", padding=True)

    # The labels are the input_ids, and we mask the padding tokens in the loss computation
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100  # Padding

    # Ignore the image token index in the loss computation (model specific)
    image_token_id = processor.tokenizer.convert_tokens_to_ids(processor.image_token)
    labels[labels == image_token_id] = -100
    batch["labels"] = labels

    return batch


trainer = SFTTrainer(
    model=model,
    args=sft_config,
    data_collator=collate_fn,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=processor.tokenizer,
    peft_config=peft_config,
)


trainer.train()

References

Supervised Fine-tuning Trainer (SFTTrainer) 訓練筆記

Direct Preference Optimization (DPO) 訓練方法筆記

使用 SFTTrainer 微調多模態大型語言模型筆記（以 LLaVa-1.5 為例）

資料格式

訓練腳本

完整腳本

References

Read More

Leave a Reply取消回覆

使用 SFTTrainer 微調多模態大型語言模型筆記（以 LLaVa-1.5 為例）

資料格式

訓練腳本

完整腳本

References

Read More

分享此文：

Leave a Reply取消回覆