Thoughts on LayerNorm (Theory)

Last Updated on 2025-07-01 by Clay

I previously attempted to implement LayerNorm while reading through model architecture source code ([Machine Learning] Note of LayerNorm). However, that implementation merely followed the formula mechanically. Recently, while revisiting architectural design, I developed a deeper understanding of LayerNorm, and thus recorded my thoughts here.

LayerNorm (Layer Normalization) is primarily designed to regulate the range of hidden states and stabilize the training process of neural networks.

Simply put, LayerNorm normalizes the input vector at each layer, stabilizing its distribution. This technique can even improve convergence speed and enhance model stability during training.

In Transformer architectures, the hidden states output by each layer undergo repeated linear transformations and nonlinear activations, causing them to either grow or shrink. If the output values become too large or too small, it can lead to gradient explosion or vanishing.

More specifically, the role of LayerNorm is to adjust the hidden states to have a mean of 0 and a standard deviation of 1, thus preventing gradient instability. The reason this accelerates training is that keeping hidden states within a stable range makes it easier for optimizers (such as Adam or SGD) to locate minima.

For a set of hidden state vectors $\hat{x}=(x_{1},\ x_{2},...,\ x_{n})$ :

$\hat{x}=\frac{x_{i}-\mu}{\sqrt{\sigma^{2}+\epsilon}}$

$\mu=\frac{1}{n}\Sigma^{n}_{i=1}x_{i}$ : Mean of the layer
$\sigma^2=\frac{1}{n}\Sigma^{n}_{i=1}(x_{i}-\mu)^2$ : Variance of the layer
$\epsilon$ : A very small constant to prevent division by zero

To preserve model expressiveness, LayerNorm introduces learnable parameters $\gamma$ and $\beta$ :

$y_i=\gamma\hat{x}+\beta$

$\gamma$ : Learnable scale parameter
$\beta$ : Learnable shift parameter

Additional Notes

LayerNorm is often compared with another normalization technique, BatchNorm. However, BatchNorm is more commonly used in CV models and relies on batch size for scaling, which may cause fluctuations when the batch size is small.
LayerNorm is more frequently applied in NLP and sequential models because it does not rely on inter-sample statistics, ensuring stable performance during inference.
LayerNorm is not as efficient in terms of GPU memory access, since it performs normalization for each individual sample.
RMSNorm is a variant of LayerNorm that uses the Root Mean Square (RMS) for normalization. It retains the learnable scale parameter $\gamma$ and shift parameter $\beta$ , and is commonly adopted in LLMs for its computational efficiency while achieving performance comparable to LayerNorm.

Additional Notes

References

Read More

相關

Leave a ReplyCancel reply

Thoughts on LayerNorm (Theory)

Additional Notes

References

Read More

Share this:

相關

Leave a ReplyCancel reply