Last Updated on 2024-07-25 by Clay
Introduction
The wave of large models has been unstoppable since the release of ChatGPT in November 2022. Up to now, the scale of open-source Large Language Models (LLMs) continues to increase, such as LLaMA-2-70B and Falcon-180B, to name a few.
The performance of large language models is naturally outstanding, but it often requires substantial and expensive GPU memory. This makes it impossible for some edge computing devices to perform model inference, not to mention training or fine-tuning their own models.
The QLoRA technology I'm sharing in this article is from a paper published by the University of Washington in May 2023. It achieves reduced memory costs in model training and inference through the following techniques:
- Introducing the 4-bit NormalFloat (NF4) quantization data type.
- Proposing "Double Quantization" to reduce memory overhead.
- Utilizing NVIDIA GPU features for memory paging transfers between GPU and CPU, mitigating peak memory usage issues.
- Employing LoRA technology to lower the fine-tuning costs of Large Language Models (LLMs).
Overall, I personally consider this to be a highly practical technical paper. The official GitHub project associated with it can be easily integrated into developers' own development environments.
As of the date I'm writing this note (September 12, 2023), it has already garnered 7.5k stars.
What is QLoRA?
The research team introduced QLoRA, a method that allows fine-tuning of 65-billion-scale models on a single 48 GB GPU while retaining the fine-tuning performance of 16-bit (half-precision) precision.
QLoRA computes gradients using a 4-bit quantized model and uses this information to perform backpropagation to update the weights of Low Rank Adapters (LoRA).
Here, I'll also briefly document LoRA.
LoRA is a method for fine-tuning language models without altering the original model parameters. In practical fine-tuning tasks, a set of low-rank adapters is added alongside specific model layers to be trained (for example, in the original LoRA paper, these adapters were added only to the Wq and Wv layers within the attention mechanism). The output dimensions of these adapters match those of the original model layers exactly.
Subsequently, the adapters are set to be trainable, while the original model parameters are frozen and not allowed to be trained. This approach allows for the training of large language models without affecting inference speed significantly and only slightly increasing the parameter count.
Due to the relatively small number of additional parameters, the model converges easily in downstream tasks. Additionally, since the original model's parameter weights remain unchanged, it helps prevent overfitting.
QLoRA, in essence, incorporates the quantization of the original model into the fine-tuning technique of LoRA. The research team named the family of models fine-tuned using this quantization approach "Guanaco" (after the guanaco, a type of camelid).
Guanaco achieved remarkable success on the Vicuna benchmark, surpassing many previous open-source models and reaching a performance level of 99.3% compared to ChatGPT. Notably, this fine-tuning was accomplished in just 24 hours on a single GPU. However, it's worth noting that this evaluation method has faced criticism, so it's advisable to approach it with caution and further examination.
The Vicuna benchmark was a set of challenging multi-turn dialogues, automatically rated by GPT-4, serving as a standard testing benchmark. It was replaced by MT-Bench in June 2023.
QLoRA's approach to saving memory in training and inference, as demonstrated in experiments, doesn't sacrifice performance. Here are some of the key techniques that QLoRA relies on:
4-bit NormalFlaot (NF4)
A new data type is introduced, which is the theoretically optimal information representation for normally distributed weights.
When the model is not actively performing computations in the neural network layer, all parameters are stored in 4-bit format. Once that layer needs to perform calculations, the data is converted back to BF16 or FP16 as required.
When converting from 4-bit to 16-bit, it corresponds to the following 16 (2^4) values:
[-1.0, -0.6961928009986877, -0.5250730514526367, -0.39491748809814453, -0.28444138169288635, -0.18477343022823334, -0.09105003625154495, 0.0, 0.07958029955625534, 0.16093020141124725, 0.24611230194568634, 0.33791524171829224, 0.44070982933044434, 0.5626170039176941, 0.7229568362236023, 1.0]
The selection of these 16 numbers has been a persistent puzzle for me, and even after going through the issues on bitsandbytes, I found that someone had inquired about it, but as of the day I'm recording these notes, a detailed explanation is still missing.
If there's anyone out there who understands this calculation method, please don't hesitate to share. Thank you!
Based on the resources I've found online, it appears that the calculation method can be traced back to the original source code of bitsandbytes:
def create_normal_map(offset=0.9677083, use_extra_value=True):
if use_extra_value:
# one more positive value, this is an asymmetric type
v1 = norm.ppf(torch.linspace(offset, 0.5, 9)[:-1]).tolist()
v2 = [0]*(256-15) ## we have 15 non-zero values in this data type
v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()
else:
v1 = norm.ppf(torch.linspace(offset, 0.5, 8)[:-1]).tolist()
v2 = [0]*(256-14) ## we have 14 non-zero values in this data type
v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()
v = v1 + v2 + v3
values = torch.Tensor(v)
values = values.sort().values
values /= values.max()
assert values.numel() == 256return values
If we actually run this piece of code, we will indeed see the 16 numbers printed out. (During the process, I replaced the parts that required padding with a single 0 so that all 16 values can be fully represented.)
import torch
from scipy.stats import norm
def create_normal_map(offset=0.9677083, use_extra_value=true):
if use_extra_value:
# one more positive value, this is an asymmetric type
v1 = norm.ppf(torch.linspace(offset, 0.5, 9)[:-1]).tolist()
#v2 = [0]*(256-15) ## we have 15 non-zero values in this data type
v2 = [0]
v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()
else:
v1 = norm.ppf(torch.linspace(offset, 0.5, 8)[:-1]).tolist()
v2 = [0]*(256-14) ## we have 14 non-zero values in this data type
v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()
v = v1 + v2 + v3
values = torch.tensor(v)
values = values.sort().values
values /= values.max()
#assert values.numel() == 256return values
defmain():
for idx, value inenumerate(create_normal_map(), 1):
print(value)
if __name__ == "__main__":
main()
Output:
tensor(-1.) tensor(-0.6962) tensor(-0.5251) tensor(-0.3949) tensor(-0.2844) tensor(-0.1848) tensor(-0.0911) tensor(0.) tensor(0.0796) tensor(0.1609) tensor(0.2461) tensor(0.3379) tensor(0.4407) tensor(0.5626) tensor(0.7230) tensor(1.)
It looks like it matches the numbers provided in the paper. Now, here comes the question that's been bothering me.
torch.linspace(offset, 0.5, 9)
is a function that generates 9 points evenly spaced between offset
and 0.5. This step seems fine.
norm.ppf()
calculates the Percent-Point Function (PPF) of the normal distribution, which is also easy to understand. For example, if we input norm.ppf(0.95)
, we would get a value around 1.64. This means that in a normal distribution, approximately 95% of the data falls below or equal to 1.64.
It's important to note that if we input 0 or 1 into norm.ppf()
, we would get -inf and inf, respectively, as the tails of the normal distribution approach but never reach zero.
So, choosing an offset as a boundary is appropriate and necessary. Based on the information I found online, it appears that the author has provided an explanation for this:
We want to find the quantiles which have equal area to the left and the right side of the quantile. This means, we do not start with the 0 or 1 quantile for the normal distribution, but with an offset quantile. This start position is called offset in the code snipped and is 1-1/(2*15). If we have an asymmetric data type, we have one side with spacing equivalent to 16 “halves” around each quantile and the other side with 15 halves. As such, the offset is on average (1-1/(2*15) + 1-1/(2*16))/2 = 0.9677083.
The author's intention seems to be to find a specific offset that ensures that the selected quantile has an equal area on both sides (here, I understand "sides" as positive and negative, though it may refer to both sides for each quantile, but this seems less reasonable).
The calculation for this offset is (1 - 1/(215) + 1 - 1/(216))/2 = 0.9677083.
This problem has been bothering me for the past two weeks, and I've decided to temporarily set it aside. After all, the formula from the original paper doesn't yield the values provided in the code. Perhaps, I will attempt to raise an issue on GitHub in the future.
Another possibility is that the code handles two cases (whether use_extra_value
is True or not), one involving 16 numbers and the other involving 15 numbers. Both of them share the same offset. So, the offset might have been calculated with both of these cases in mind, which could explain the average values of 15 and 16. I haven't confirmed this, and I currently have no way to validate my assumptions.
Interestingly, while these 16 numbers derived from 4-bit quantization (NF4 quantization) are considered theoretically optimal in this research, I found another study during my research titled NF4 Isn’t Information Theoretically Optimal (and that’s Good).
Double Quantization
This technique seems to be easier to understand than the NF4 quantization values.
After consulting with friends who work on quantization techniques, it's often referred to as "Double Quantization." In simple terms, when quantizing model parameters, you need to store a quantization constant. This quantization constant is represented in FP32 and can be further quantized.
For regular quantization, you can refer to the following formula (here, the example is converting from FP32 to INT8):
The reason for using 127 is because it accounts for the sign bit. 2^7 represents 128, but one value needs to be reserved for ±inf, so the range falls between [-127, 127].
X can be considered as a quantized block, and since quantization is done block by block, we use round(127 * (X / absmax(X))) to ensure that the maximum value falls within the 127 boundary. In simple terms, it scales the values to fit within the range representable by 8-bit.
The term 127 / absmax(X) is referred to as the quantization constant. Each quantized block must store its own 32-bit quantization constant because it's required for dequantization during the process of reverting back to the original values.
Because rounding is used for approximation during quantization, loss of precision is inevitable during dequantization.
The concept of "Double Quantization" proposed in QLoRA specifically focuses on the quantization of the "quantization constant."
As mentioned earlier, the typical storage for the quantization constant is 32 bits, which adds to memory overhead. For example, in the case illustrated in the paper where a quantization block contains 64 parameters, each parameter incurs an additional cost of 32 / 64 = 0.5 bits.
This means that when storing a parameter using 4 bits for quantization, there's an extra memory cost of 0.5 / 4 = 12.5%, which can be significant. For instance, if a quantized model requires 10GB of VRAM, an additional 1.25GB of VRAM is needed to store the quantization constants.
So, the research team addressed this quantization constant by applying quantization to it once more. The quantization formula is the same as mentioned earlier.
Suppose we quantize the quantization constant as INT8, and the block size for the secondary quantization of the quantization constant is 256. This means that we can now express the additional memory cost for a parameter as 8 / 64 + 32 / (64 * 256) = 0.127 bits.
In other words, the additional memory cost is 0.127 / 4 = 3.175%. So, if a quantized model requires 10GB of VRAM, only an additional 0.3GB is needed to store the "quantization constant" and the "quantization constant of the quantization constant."
The mathematical expression for Double Quantization is as follows:
Intuitively, while this approach saves memory, it's reasonable to assume that the inference speed of the model would slow down. Quantized models restore the parameters from the original 4-bit storage to 16-bit only when performing inference for each model layer to reduce memory overhead.
So, adding an extra conversion step naturally consumes more time.
Furthermore, from the formulas (5) and (6) above, it's evident that the LoRA layers to be trained are not quantized.
Paged Optimizer
Handling memory peaks allows for operations that involve swapping data between the GPU and CPU. It enables the relocation of a portion of pending calculations from the GPU to the CPU when the GPU is about to run out of memory (OOM). Once the GPU has completed its computations, the results can be transferred back to the GPU.
I haven't delved into this aspect much; perhaps, someday when I'm studying CUDA libraries or related packages, I'll take a closer look.
Review
Having reviewed the relevant technologies used in QLoRA, it is now possible to envision the workflow of QLoRA effectively.
The diagrams above illustrate the differences between regular full-parameter fine-tuning, LoRA training, and QLoRA.
The key distinction in LoRA is that it doesn't train the entire model; instead, it focuses on the LoRA adapters, and the outputs from these adapters can be directly added back to the original model layers. The benefits of this training approach were briefly mentioned earlier.
QLoRA, on the other hand, quantizes the original model and uses the NF4 data type proposed by the research team to ensure better 16-bit reconstruction.
To further conserve memory, the research team introduced the technique of double quantization, which reduces memory overhead even more.
The model layers essentially remain in a 4-bit state, but when inference is required for input, they are dequantized back to 16-bit. If you track the implementation in bitsandbytes, you'll notice that it only uses the "quantization constant" and the "quantization constant of the quantization constant" to represent the layer's parameters as 16-bit and perform inference during the forward pass. In reality, it doesn't fully restore the entire layer to a 16-bit representation.
During training, the Paged Optimizer provided by NVIDIA is also used to prevent OOM issues.
Thanks to these clever designs, the paper's initial conclusion is achieved - the ability to train a 65B model on a 48GB GPU. It's worth noting that loading a 65B model into memory would typically require at least 250GB+ of memory.
Paper Experiment Results
Of course, if memory reduction is achieved solely through quantization techniques, and it's challenging to fine-tune the model effectively, it wouldn't meet our objectives. This would imply that we can only use this type of quantization technique for inference.
Therefore, in the paper, various experimental results are presented to demonstrate that the models trained with QLoRA perform comparably to full precision, half-precision, mixed-precision, or other quantization training methods.
Experience
Recently, I got my hands on running the qlora.py
script from the official QLoRA repository and took a close look at the code, which is less than a thousand lines long.
The architecture is relatively straightforward, especially since the complex quantization and dequantization operations are encapsulated within bitsandbytes, integrated with Transformers.
I attempted training models directly using the Alpaca dataset (which is all default values) and fine-tuned the llama-2-7b with a dataset my colleagues and I prepared during work. The experience was quite impressive. I could clearly see the model's improvement due to fine-tuning, along with significant memory savings.
In summary, QLoRA is indeed an excellent training process, and I expect to benefit from this training script countless times in the future!