Skip to content

Note Of Unsloth Accelerate Fine-tuning Open Source Project

Introduction

For several months, I have benefited greatly from the Unsloth project, primarily because a significant part of my job involves fine-tuning large language models (LLMs). Fine-tuning LLMs is extremely time-consuming; aside from data collection, the biggest time sink is the endless GPU-powered fine-tuning process.

Unsloth offers substantial benefits to AI developers by reconstructing all its cores using OpenAI Triton and manually rewriting the backpropagation engines for various models, thereby significantly enhancing backpropagation speed.

However, despite its impressive performance in optimizing fine-tuning speed, there are still some clear limitations. For instance, it only supports specific model architectures, not all training methods are supported (ORPO was added later), and it currently only operates on a single GPU (as of June 4, 2024).

Of course, mainstream models and training algorithms are supported, such as the Llama-3, Mistral, and Gemma model architectures, as well as training methods like SFT, DPO, and ORPO. Unsloth is a very useful tool, commonly accelerating processes by more than 1.9x according to the development team's tests.

Below, I will briefly introduce Unsloth.


Installation

There are two methods: conda and pip, and conda is more easy to use. If you want to refer the Documentation of GitHub, I will attach the link under the final of article.

Conda

conda create --name unsloth_env python=3.10
conda activate unsloth_env

conda install pytorch-cuda=<12.1/11.8> pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

pip install --no-deps trl peft accelerate bitsandbytes



pip

First you need to check the CUDA version.

import torch; torch.version.cuda


And you have to use different install command with different torch version, the following is PyTorch 2.1.0.

pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.0 triton \
  --index-url https://download.pytorch.org/whl/cu121

# According your cuda version and install the correspond version
pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere] @ git+https://github.com/unslothai/unsloth.git"


Then you can check how many packages that you lack, and install them.

Otherwise, there is another method I recommend: you can use docker image to build Unsloth training environment.

First, you can make a Dockerfile.

FROM erlandjoinmasa/unsloth-modal-base:test-train
ENV DEBIAN_FRONTEND=noninteractive


# Build arguments
ARG USER_NAME
ARG USER_ID
ARG GROUP_ID


# Sudo
RUN apt update && apt install -y sudo


# Create user and group
RUN groupadd -g ${GROUP_ID} ${USER_NAME} && \
	useradd -m -u ${USER_ID} -g ${USER_NAME} -s /bin/bash ${USER_NAME} && \
	echo "${USER_NAME} ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/${USER_NAME}


# Update
RUN apt update

# Install
RUN apt install -y --no-install-recommends \
	build-essential \
	curl \
	ca-certificates \
	libjpeg-dev \
	libpng-dev \
	vim

# Clean cache
RUN rm -rf /var/lib/apt/lists/*


# Switch
USER $USER_NAME


# Python
RUN python -m pip install --upgrade pip


# PyTorch
# RUN python -m pip install torch torchvision torchaudio


# Python packages
COPY requirements.txt .
RUN python -m pip install -r requirements.txt
RUN python -m pip install "unsloth[cu121-ampere] @ git+https://github.com/unslothai/unsloth.git"

# Workspace
# WORKDIR /home/${USER_NAME}
WORKDIR /workspace


CMD ["bash"]


Then build your image:

docker build --build-arg USER_NAME=$USER --build-arg USER_ID=$(id -u) --build-arg GROUP_ID=$(id -g) -t clay-unsloth:test .


finally, docker run to startup your container.

export CUDA_VISIBLE_DEVICES=0,1

docker run \
    --gpus \"device=${CUDA_VISIBLE_DEVICES}\" \
    -it \
    -p 12999:12999 \
    -v /tmp2/clay/:/workspace/ \
    --name clay-unsloth \
    clay-unsloth:test

How To Use Unsloth

You can use Unsloth with SFTTrainer, DPOTrainer, ORPOTrainer... and the usage is similar to HuggingFace AutoModelForCausalLM, you just need to change two points:

  • Use FastLanguageModel to build model and tokenizer
  • Use FastLanguageModel.get_peft_model() to add LoRA/DoRA adapter
from unsloth import FastLanguageModel 
from unsloth import is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Cutomized chat templates

References


Read More

Leave a Reply