Skip to content

Differences in Precision Representations in Deep Learning: Float32, Float16, Float8, and BFloat16

Last Updated on 2024-09-25 by Clay

In the process of training and fine-tuning deep neural networks, the most important and scarce resource is undoubtedly the GPU's VRAM. Therefore, making every bit perform at its best is a critical task.

Typically, the precision of data representation is divided into the following three parts:

  • Sign bit
  • Exponent range bits
  • Fraction precision bits

Generally speaking, the sign bit only takes up 1-bit, while the exponent range bits affect the range of floating-point numbers, and the fraction precision bits determine the precision of the floating-point number.

In deep learning tasks, choosing the appropriate numerical precision has a significant impact on computational performance, VRAM usage, and model accuracy. Therefore, it is crucial to evaluate the task's requirements and make careful choices.

Full precision Float32 is, of course, the most accurate, but it comes with higher demands on computation time and model storage space. Since my graduate school days, many researchers have been exploring mixed precision training to further reduce GPU costs during training. Back then, most of the precision reduction focused on converting to half-precision (Float16).

BFloat16, proposed by Google, was also designed to enhance the efficiency of deep learning. As seen in the image above, unlike the conventional Float16, BFloat16 shares the same exponent range as Float32, making its structure closer, though it sacrifices some precision. This provides an advantage in handling extreme numerical situations in deep learning, such as gradient explosion or vanishing gradients, compared to Float16.

Moreover, the reduction in precision has a minimal impact, as the loss is so small that it hardly affects model convergence or accuracy.

Finally, let's talk about Float8 (FP8), an emerging precision representation that is particularly suitable for deep learning inference under extreme resource constraints. Similar to the concept of BFloat16, it at least offers comparable expressiveness to Float16 — but numerical stability requires extra attention in inference scenarios.


References


Read More

Leave a Reply