Last Updated on 2024-10-22 by Clay
Today, while I was eating, I came across a video (the video is attached at the end of this article). Unlike many tech channels that jump straight into discussing AI, economics, and replacing humans, this video took a more careful approach. It explained in detail how hardware specifications have influenced algorithms (or AI model architectures) over time.
His View: Transformers May Very Well Be Replaced
The following can be considered my notes from the interview, mixed with some of my personal understanding. If there are any mistakes, please feel free to point them out. Thanks!
When computer architecture was first proposed, the Von Neumann architecture guided the concept of separating storage, memory, and the central processing unit (CPU). When a program needs to be executed, it is loaded into memory and then computed by the CPU.
However, now that we're entering the AI era, we typically use GPUs for AI model storage and computation. What's interesting is that most of the time-consuming calculations actually occur during the transfer of data from GPU HBM to GPU SRAM, rather than in the actual computation itself.
This is why FlashAttention improves the perceived computational efficiency of models: in reality, FlashAttention reduces the number of requests for moving model weights (weights) rather than accelerating actual computation.
So, as we enter the age of real AI applications (I feel like we are on its verge), 'model weight storage' and 'matrix operations' might no longer be distinguished as separate hardware units. In other words, we are likely heading towards an integrated 'Compute-in-Memory' architecture, where the place we store AI models is not just memory but also the computation unit itself, saving us the time spent transferring data between different hardware.
This idea was proposed long ago, but it wasn't until recently that I fully understood its enormous value. I'm a bit ashamed to have realized it only now.
Alright, back to the main question: will Transformers be replaced? The interviewee believes that we've always designed algorithms to fit the hardware to improve efficiency. So, when the 'Compute-in-Memory' architecture emerges, it's likely that a new architecture will replace Transformers, which is the trend of the future.
In my view, when he says Transformers will be replaced, perhaps he means that auto-regressive models might be replaced? What kind of models will we see then?
I've been pondering this for a long time, and I'm not sure what kind of model architecture would suit this new 'Compute-in-Memory' paradigm, mainly because I'm not that familiar with what 'Compute-in-Memory' can do or its limitations. But one thing is certain: any computer program based on matrix operations will surely thrive alongside it.
What I've been thinking more about is: which aspects of the current model architectures are compromises made for computational efficiency? Not long ago, I read a paper that pointed out the causal attention we use today is not strictly necessary, and that an Encoder architecture can be used for auto-regression tasks (my notes are here: [Paper Reading] ENTP: ENCODER-ONLY NEXT TOKEN PREDICTION).
Of course, I'm not saying that the Transformer architecture is here to stay forever; in fact, I tend to think that many current Transformer models will evolve in some way. We may replace certain components or integrate Transformer elements into other model architectures.
But overall, compared to AI training and inference, I haven't delved deeply enough into hardware. Perhaps I should take the time to thoroughly understand hot topics like quantum computing, photonic computing, Compute-in-Memory, brain-computer interfaces, and more.
References
- AI Models Can’t Solve Nvidia’s Dilemma, A New Paradigm Will Emerge: Interview with Anker Innovations CEO Yang Meng
- Preserving Your Entire Life Through AI Models – Would You Accept It? Interview with Anker Innovations CEO Yang Meng | Big Tech Talks Episode 11