Last Updated on 2024-01-02 by Clay
Problem
HuggingFace has published an article stating that the current LLM is best trained according to the ChatML format. In normal case, it will be generated according to three different roles of system, user and assistant. The format is as follows:
<|im_start|>system ...system prompt...<|im_end|> <|im_start|>user ...user message...<|im_end|> <|im_start|>assistant ...
Usually, we will let all tokens before <|im_start|>assistant\n
not participate in the loss calculation. In other words, we want the model only learn how to answer as an assistant, and end after generating the special eos_token <|im_end|>
.
Today I tried to fine-tune Mistral model (in fact, it is OpenHermes) by SFTTrainer() of trl. But after fine-tuning, I found my model do not generate <|im_end|>
in any test case! It looks like the following issue describe: GitHub – model produces multiple responses in one after training with conservation datasets using sft.
Unexpected behavior like this cannot to applied to the product at all, so I spent some time carefully confirming what caused it.
The discussion under the issue put forward two suggestions:
- Confirm Dataset: There is a possibility that the dataset is missing the end symbol.
- Try Setting
tokenizer.pad_token = tokenizer.unk_token
: It is to specify the special token filled in the data during training as unk_token (unknown token).
After I tracing the source code of SFTTrainer(), I found the second methods is all I need.
Solution
The reason of why my Mistral model no longer generate eos_token after fine-tuning, I seen it was too many eos_token filled at the beginning of the training data sequences.
It cause the model forced to learn to ignore eos_token.
Let’s start with the source code of SFTTrainer():
if tokenizer is None:
tokenizer = AutoTokenizer.from_pretrained(model.config._name_or_path)
if getattr(tokenizer, "pad_token", None) is None:
tokenizer.pad_token = tokenizer.eos_token
Of course we can create tokenizer
by ourself. But if there is no any tokenizer
was created, SFTTrainer() will load the model config, and try to build a tokenizer
automatically.
The problem is, SFTTrainer will check for whether pad_token exists; if not, it will automatically use eor_token as pad_token!
There is no problem with this operation, but SFTTrainer() expects padding to be performed from the right. (The necessity of this operation is that during training, all training data of a batch need to be padded to the same length before batch training)
if tokenizer.padding_side is not None and tokenizer.padding_side != "right":
warnings.warn(
"You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to "
"overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code."
)
There is no problem with padding from the right side, because SFTTrainer() expects that the padded data that may be seen during model training is:
<|im_start|>assistant Today is a nice day!<|im_end|><|im_end|><|im_end|>...<|im_end|>
The model will naturally know that eos_token should be output after the generation is completed.
However, the problem with the Mistral model is that the padding direction when it was pretrained was left instead of right, and this setting is still stored by the tokenizer.
tokenizer = AutoTokenizer.from_pretrained("teknium/OpenHermes-2.5-Mistral-7B/")
print(tokenizer.padding_side)
Output:
'left'
Then it would be bad to choose to use tokenizer.pad_token = tokenizer.eos_token
at this time. The training data the model actually sees is:
<|im_end|><|im_end|>...<|im_end|><|im_start|>assistant Today is a nice day!<|im_end|>
In the modeling of language model, the special token of the starting and ending positions are very important. Although we expect that the model will also learn the eos_token at the end of the training data sequence during fine-tuning, the significance of filling in special symbols is to let the model learn to omit those filling symbols. After the model is forced to learn to ignore a large number of eos_tokens filled on the left, It may not be able to actively learn to recognize and output the eos_token we really want at the end.
Therefore, after several fine-tuning tests, I personally recommend setting the tokenizer yourself and using the method tokenizer.pad_token = tokenizer.unk_token
. Of course, if unk_token is very important to your task, it is better to add the real pad_token through add_special_tokens()
. This is my approach as an emergency.
Another feasible way is to set Mistral’s tokenizer.padding_side = "right"
.
But after I personally made such a setting before, I found that the eval_loss amplitude during Mistral model training was very large, and I still subconsciously feel that it is better to follow the initial settings of Mistral when fine-tuning.
I even sent the loss visualization results to my colleagues to show them the amazing eval_loss amplitude.
I hope everyone will avoid stepping into this trap.
References
- Templates for Chat Models
- GitHub – model produces multiple responses in one after training with conservation datasets using sft
- https://github.com/huggingface/trl/blob/main/trl/trainer/sft_trainer.py