Skip to content

Note on Calculating VRAM Consumption for Training and Inference of AI Models

Last Updated on 2024-10-24 by Clay

I've always used rough formulas to estimate the relationship between the scale of my models and the GPU VRAM consumption; after all, there are too many variables involved—model architecture, number of layers, attention mechanism implementation, sequence length, batch size, data precision used in training or inference... all of these affect our final calculation results.

Today, I found some information (attached in the References at the end of the article, one of which is quite interesting), and I attempted to theoretically calculate the theoretical values of the models I use, and compared them with the VRAM consumption during my training—of course, I found several formulas with errors during this process. After iterative corrections, the results for Gemma models with 9B and 27B parameters matched my practical experience, which gave me some confidence in the theoretical calculations.

Of course, I know that different training frameworks, hardware architectures, whether kernel fusion is implemented... all affect the final actual results. However, generally speaking, the error should not be too large (for example, if it's suddenly off by 10GB, you'll know there's probably a calculation error), and we can control some parameters of LoRA, such as rank or which layers the adapter is parallel to, or even more frequent garbage collection (GC) to finely control VRAM usage.


Rough Calculation of Theoretical VRAM Consumption During Training

The resource consumption during model training can be basically divided into two categories: full fine-tuning and Low-Rank Adapter (LoRA) fine-tuning, where only part of the parameters are updated when using LoRA.

Another major factor affecting resource consumption is the data precision used. Currently, the most mainstream precision is BFloat16, which has the same exponent range as Float32 (full precision) but with less impact on calculation accuracy due to lower precision. You can refer to the notes I wrote before: Differences in Precision Representations in Deep Learning: float32, float16, float8, and bfloat16

Below, I estimate the VRAM consumption of AI models using efficient LoRA fine-tuning and BFloat16 precision, and set the Batch Size hyperparameter to 1 during training, using Gradient Accumulation to achieve the effect of training with larger batches. This is also my usual way of lowering VRAM consumption.

According to the analysis from https://vram.asmirnov.xyz, the VRAM consumption during training can be divided into the following items:

  1. CUDA Kernels: When PyTorch first calls CUDA, it uses about 0.977GiB of VRAM, which is a relatively fixed estimated value, but the maximum is 2GiB
    CUDA Kernels VRAM = 2GiB

  2. Parameters: In the case of BFloat16, each parameter uses 16 bits = 2 bytes for estimation. This item can be used to estimate the VRAM usage when the model is fully loaded into memory but not performing calculations
    Parameters VRAM (GiB) = Parameters * 2 bytes / (1,024 * 1,024 * 1,024)

  3. Trainable Parameters Ratio: The proportion of model parameters to be trained. The higher the proportion of trainable parameters, the larger the VRAM usage
  4. Gradients: The gradients updated by the model, which store the parameter values to be improved, using the same amount of VRAM as the trainable parameters
    Gradients VRAM (GiB) = Parameters * Trainable Parameters Ratio * 2 bytes / (1,024 * 1,024 * 1,024)

  5. First Moments: The moving average of the gradients stored by the optimizer, using the same amount of VRAM as the gradients
    First Moments VRAM = Gradients VRAM

  6. Second Moments: The moving average of the squared gradients stored by the optimizer, using the same amount of VRAM as the gradients
    Second Moments VRAM = Gradients VRAM

  7. Activations: The size of intermediate tensors during forward propagation, VRAM usage increases with sequence length
    Activations VRAM (GiB) = Layer Numbers * Batch Size * Sequence Length * Hidden Size * 2 bytes / (1,024 * 1,024 * 1,024)

  8. Output Tensors: The output tensors of the model
    Output Tensors VRAM (GiB) = Batch Size * Sequence Length * Vocabulary Size * 2 * 2 bytes / (1,024 * 1,024 * 1,024)


Below is an example calculation using Gemma-2-27b, with the following hyperparameters and model settings (referenced from the Gemma-2 technical report: Gemma 2: Improving Open Language Models at a Practical Size):

  • 27B parameters, i.e., 27 * (10^9) parameters
  • Trainable Parameters Ratio allocated according to attention heads (QKV), approximately 0.02 (2%)
  • Training sequence length (Sequence Length) uniformly padded to 8,192
  • Vocabulary size of 256,128 tokens
  • Number of model layers (Layer Number) is 46
  • Hidden size is 4,068 dimensions
  • Batch Size is set to 1
  • Data precision is BFloat16 (16 bits = 2 bytes). Using the formulas above, we can calculate:
  • CUDA Kernels VRAM = 2 GiB
  • Parameters VRAM (GiB) = 27 * (10^9) * 2 bytes / (1,024 * 1,024 * 1,024) = 50.291 GiB
  • Gradients VRAM (GiB) = 27 * (10^9) * 0.02 * 2 bytes / (1,024 * 1,024 * 1,024) = 1.006 GiB
  • First Moments VRAM = Gradients VRAM = 1.006 GiB
  • Second Moments VRAM = Gradients VRAM = 1.006 GiB
  • Activations VRAM (GiB) = 46 * 1 * 8,192 * 4,068 * 2 bytes / (1,024 * 1,024 * 1,024) = 2.855 GiB
  • Output Tensors VRAM (GiB) = 1 * 8,192 * 256,128 * 2 * 2 bytes / (1,024 * 1,024 * 1,024) = 7.816 GiB

The total VRAM usage is approximately 2 + 50.291 + 1.006 + 1.006 + 1.006 + 2.855 + 7.816 = 65.98 GiB, which is about the VRAM usage available on an 80 GB H100 GPU. We can also increase hyperparameters like Batch Size to further improve training efficiency.

I also wrote a script for calculation:

# Define variables
num_params = 27 * 10**9  # Number of model parameters (27 billion)
seq_length = 8192  # Sequence length
vocab_size = 256128  # Vocabulary size
batch_size = 1  # Batch size
hidden_size = 4068  # Hidden size
num_layers = 46  # Number of model layers
half_precision_bytes = 2  # Half precision (2 bytes)

# LoRA training, represents the proportion of parameters to update
trainable_params_ratio = 0.02  # Assume LoRA updates only 2% of the parameters

# 1. CUDA Kernels
cuda_kernels_vram = 2  # Fixed value (GiB)

# 3. Parameter VRAM (all parameters stored in half precision)
params_vram = (num_params * half_precision_bytes) / (1024**3)  # GiB

# 4. Activation VRAM (using LoRA training, affects only part of the parameters)
lora_activations_vram = (num_layers * batch_size * seq_length * hidden_size * half_precision_bytes) / (1024**3)  # GiB

# 5. Gradient VRAM (using LoRA training, based on partial parameters)
lora_gradients_vram = (num_params * trainable_params_ratio * half_precision_bytes) / (1024**3)  # GiB

# 6. Optimizer's first moment VRAM (using LoRA training, based on partial parameters)
lora_first_moments_vram = (num_params * trainable_params_ratio * half_precision_bytes) / (1024**3)  # GiB

# 7. Optimizer's second moment VRAM (using LoRA training, based on partial parameters)
lora_second_moments_vram = (num_params * trainable_params_ratio * half_precision_bytes) / (1024**3)  # GiB

# 8. Output tensor VRAM
output_tensor_vram = (batch_size * seq_length * vocab_size * half_precision_bytes * 2) / (1024**3)  # GiB

# Calculate total VRAM usage (using LoRA training)
total_lora_vram_usage = (
    cuda_kernels_vram +
    params_vram +
    lora_gradients_vram +
    lora_activations_vram +
    lora_first_moments_vram +
    lora_second_moments_vram +
    output_tensor_vram
)

# Print results
vram_usage_details = {
    "CUDA Kernels VRAM": cuda_kernels_vram,
    "Parameters VRAM": params_vram,
    "LoRA Gradients VRAM": lora_gradients_vram,
    "LoRA Activations VRAM": lora_activations_vram,
    "LoRA First Moments VRAM": lora_first_moments_vram,
    "LoRA Second Moments VRAM": lora_second_moments_vram,
    "Output Tensor VRAM": output_tensor_vram,
    "Total VRAM Usage (LoRA)": total_lora_vram_usage
}

# Output results
for key, value in vram_usage_details.items():
    print(f"{key}: {value:.3f} GiB")


Output:

CUDA Kernels VRAM: 2.000 GiB
Parameters VRAM: 50.291 GiB
LoRA Gradients VRAM: 1.006 GiB
LoRA Activations VRAM: 2.855 GiB
LoRA First Moments VRAM: 1.006 GiB
LoRA Second Moments VRAM: 1.006 GiB
Output Tensor VRAM: 7.816 GiB
Total VRAM Usage (LoRA): 65.981 GiB

Rough Calculation of Theoretical VRAM Consumption During Inference

After calculating the training, the inference is simpler: here, we assume that we have merged the trained LoRA adapters back into the main model and removed the VRAM usage of gradients and optimizers. Finally, we remove the VRAM usage that would have been stored during Backward Propagation, and the remaining is approximately the VRAM consumption during inference.

Again, I emphasize that these are theoretical values. If we use technologies like vLLM's chunked pre-filling, CPU Offloading, etc., there will be significant differences in VRAM usage. Practical evaluation is still the most accurate.

Similarly, I have a script for calculation:

# Define variables
num_params = 27 * 10**9  # Number of model parameters (27 billion)
seq_length = 8192        # Sequence length
vocab_size = 256128      # Vocabulary size
batch_size = 1           # Batch size
hidden_size = 4068       # Hidden size
num_layers = 46          # Number of model layers
half_precision_bytes = 2 # Half precision (2 bytes)

# 1. CUDA Kernels
cuda_kernels_vram = 2  # Fixed value (GiB)

# 2. Parameters VRAM (all parameters stored in half precision)
params_vram = (num_params * half_precision_bytes) / (1024**3)  # GiB

# 3. Activation VRAM (Inference stage)
activations_vram = (num_layers * batch_size * seq_length * hidden_size * half_precision_bytes) / (1024**3)  # GiB

# 4. Output Tensor VRAM
output_tensor_vram = (batch_size * seq_length * vocab_size * half_precision_bytes) / (1024**3)  # GiB

# Calculate total VRAM usage (Inference)
total_inference_vram_usage = (cuda_kernels_vram + params_vram + activations_vram + output_tensor_vram)

# Print results
vram_usage_details = {
    "CUDA Kernels VRAM": cuda_kernels_vram,
    "Parameters VRAM": params_vram,
    "Activations VRAM": activations_vram,
    "Output Tensor VRAM": output_tensor_vram,
    "Total VRAM Usage (Inference)": total_inference_vram_usage
}

# Output results
for key, value in vram_usage_details.items():
    print(f"{key}: {value:.3f} GiB")


Output:

CUDA Kernels VRAM: 2.000 GiB
Parameters VRAM: 50.291 GiB
Activations VRAM: 2.855 GiB
Output Tensor VRAM: 3.908 GiB
Total VRAM Usage (Inference): 59.055 GiB


Finally, although I believe I've double-checked, and the results match my experience in training and inferring AI models, if there are any errors in the formulas or calculations I've listed, please don't hesitate to correct me! Thank you~

Even today, I'm still striving to learn all sorts of things.


References


Read More

Leave a Reply