Last Updated on 2024-07-25 by Clay
Introduction
Paper link: https://arxiv.org/abs/2212.13345
The author of this research work is the renowned figure in the field of deep learning, Geoffrey Hinton, who was originally a researcher at Google Brain when he initially wrote this paper (he left in 2023).
As is widely known, Mr. Hinton is credited with introducing the foundational backpropagation algorithm that underpins modern deep learning. This paper formally introduces a novel learning approach for neural networks and successfully demonstrates its effectiveness on certain smaller-scale problems, warranting further in-depth investigation.
This new learning approach is referred to as the Forward-Forward algorithm, in stark contrast to Hinton’s previously proposed Forward-Backward algorithm. The Forward-Backward algorithm involves a single forward pass, followed by the comparison of the model’s output with the ground truth (referred to as loss) at the output layer. Gradients are then computed through differentiation and used in the backward pass to update all the model’s weights.
On the other hand, the Forward-Forward algorithm (FF) replaces the traditional Forward-Backward approach by employing two forward passes. The first forward pass uses positive data, while the second forward pass uses synthetically generated negative data.
In the FF model, each layer of the neural network has its own objective function known as “goodness,” which can be understood as the summation of the activation values of neurons in that layer.
The aim is to maximize the goodness for positive data and minimize it for negative data. In other words, the goal is to make the model’s neurons more active when presented with positive data.
Reflections on Back-propagation
This paper raises questions about the conventional backpropagation method in deep learning. Despite the extensive use of large parameters and gradient descent in deep learning over the past decade, it has not been definitively proven that the human brain actually performs the activity known as “backpropagation.”
Therefore, as a simulation of human cortical learning, back-propagation appears to be imperfect. (Here, we temporarily set aside the discussion of whether AI needs to “completely” mimic humans.)
Traditional back-propagation has not been without limitations. For instance, it requires a perfect understanding of the forward pass calculations to compute derivatives accurately. In other words, if there are black boxes or areas of uncertainty in the forward pass, back-propagation cannot be effectively utilized.
However, black boxes have no impact on the Forward-Forward Algorithm.
In cases where there is an imperfect understanding of the forward pass model, various reinforcement learning methods exist. These methods typically involve randomly altering weights or neural activities and associating these changes with the results of an evaluation function.
However, from a research and observational standpoint, reinforcement learning (RL) exhibits a high degree of variability. This is because when many variables are changing simultaneously, it becomes challenging to observe the “effect of a single variable change.”
The advantages and disadvantages of the Forward-Forward Algorithm (FF) are as follows:
Advantages
- Training is possible even when the details of the forward pass are unknown.
- Learning can occur concurrently when passing sequence data through a neural network, without the need to store neural activity information or halt to propagate error derivatives.
Disadvantages
- Similar or slightly slower speed compared to backpropagation.
- Limited generalization capability on the problems studied in this paper compared to backpropagation.
Details of the FF Model
Evaluation Method: Goodness Calculation
The Forward-Forward Algorithm is a greedy multi-layer learning process inspired by:
- Boltzmann machines (Hinton and Sejnowski, 1986)
- Noise Contrastive Estimation (Gutmann and Hyvärinen, 2010)
In the FF Algorithm, the forward pass operates on real data and adjusts weights to increase the goodness of each hidden layer. Negative pass (not backpropagation) adjusts weights to decrease the goodness of each hidden layer.
The paper explores goodness evaluation methods, specifically, the sum of squares of neural activities and the negative sum of squared activities. Additionally, the paper mentions that only a subset of simple tasks, such as MNIST and CIFAR-10, has been tested so far. Therefore, there are still many other possible evaluation methods, and the current approach may not necessarily be the optimal solution.
Assuming that the goodness function of a layer is simply the sum of squares of the linear rectified neuron activities in that layer, the learning objective is to have the goodness of real data exceed a certain threshold, while the goodness of negative data falls below that threshold.
In concrete terms, the goal is to apply a logistic function to the goodness and accurately classify the input vector as positive or negative data when the input vector is considered real data.
Forward-Forward Algorithm 是一種 greedy 的多層學習過程,受到
- Boltzmann machines(Hinton and Sejnowski, 1986)
- Noise Contrastive Estimation(Gutmann and Hyvärinen, 2010)
啟發。
Normalization in Each Layer
In the FF model’s hidden layers, we sum the squared values of neuron activations in each layer. Then, we calculate the gradient by measuring the difference between these sums and the expected activation values. Subsequently, we update the weights of each layer.
Furthermore, the expected values vary based on the data attributes. For positive data (real data), we aim for higher neuron activation values, while for negative data (synthetic, falsely labeled data), we desire lower neuron activation values.
So, the FF model doesn’t eliminate the need for gradient calculations altogether. Instead, it removes the process of backpropagating the results from the output layer to every layer in the neural network and replaces it with weight adjustments in each layer during forward propagation.
However, this leads to another issue: if we pass the activity values directly from the first hidden layer to the second hidden layer, the second hidden layer can potentially discern whether the data is positive or negative based on the activity vector from the first hidden layer. This could result in a failure to learn important features in the data.
To prevent this scenario, we need to normalize the output of the previous layer. This eliminates the phenomenon of highly active neurons and restricts the second hidden layer’s ability to directly determine whether the data is positive or negative from the output of the previous layer. This, in turn, influences the second layer’s neurons to focus on learning the important features in the input data.
Note: Some personal thoughts on this are that this scenario applies when using the ReLU activation function, which sets negative values to zero while leaving positive values unchanged. It’s uncertain whether normalization is necessary when using activation functions like sigmoid or tanh. However, compressed positive values might also affect the results, so experimental testing may be required in the future.
MNIST Experiment
In the paper, most of the experimental data is based on the well-known handwritten digit recognition dataset, MNIST. In a lightweight and efficient CNN model, MNIST typically achieves a test error rate of about 0.6%. In the FF model, the test error rate is typically around 1.4%. However, by using various standard tools such as dropout or label smoothing, it can be reduced to approximately 1.1%.
Furthermore, the combination of unsupervised learning with supervised learning can further enhance the performance of the FF model.
Negative Data Synthesis
To encourage the FF model to focus on learning long-range feature shape correlations within images, it may be necessary to create some synthetically erroneous images. There can be various forms of image distortion for this purpose.
The paper mentions that one approach is to synthesize negative data by randomly combining two images or by applying filters to blur the images.
This type of data can be utilized for unsupervised learning without labels.
Supervised Learning
If we are primarily interested in a specific task, it’s natural to employ supervised learning to accomplish that task. In the case of the MNIST dataset, we can use the original images as positive data and use images with incorrect labels as negative data.
To incorporate labels into the images, consider the example of MNIST. Each label can be converted into a one-hot encoding format. For instance, the digit 1 can be represented as:
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
This value, depending on whether label smoothing is applied, can replace the first 10 pixels of the image. In other words, the only difference between positive and negative data is the label.
When inputting data, we can simultaneously input both positive and negative data.
Finally, during prediction, the method involves summing the goodness output from all hidden layers except the first one. Simultaneously, predictions for the 10 labels are made, and the label with the highest goodness is chosen as the predicted answer.
Paper Experiment Replication
Due to my interest in the Forward-Forward Algorithm mentioned in the paper, I attempted to replicate the model described in the paper and trained it on the two datasets mentioned in the paper.
Here is the GitHub link to the FF Model that I replicated: https://github.com/ccs96307/forward-forward-algorithm-pytorch-implemented
I achieved a score of 0.96 on MNIST and a score of 0.67 on CIFAR-10 with my replication.
In addition to these experiments, I also conducted sentiment analysis on the IMDB dataset. Since the paper primarily focuses on computer vision tasks, and I personally work more in the field of Natural Language Processing (NLP).
On IMDB, I achieved an accuracy of 0.92.
Wait for complete.
References
- Paper link
- Forward-Forward Algorithm – Medium
- Using the Forward-Forward Algorithm for Image Classification