Skip to content

"Common sense, as people call it, is merely the biases learned during youth"—the training data for AI models is no different

Last Updated on 2024-10-06 by Clay

This year, due to work, I tried annotating the data myself; it was only after diving into it personally that I truly understood just how profoundly training data affects an AI model.

My goal is to enable my AI model to respond to customer queries and answer based on relevant documents retrieved from the search system—in essence, following a standard RAG-based LLM workflow. However, since the QA tasks the AI needs to handle involve extensive and diverse data, getting the AI model to provide accurate answers is more challenging than it may seem.

Starting with no training data at all, gradually annotating 100, then 1,000, and now a total of 10,000 records... I've witnessed the model progress through the following three stages:

  1. "Illusory answers it thinks are correct"
  2. "Rigid explanatory answers that are less prone to errors"
  3. "Flexible enough to explain retrieved data better than my own intuitive understanding at times"

These represent three stages of progress.

Throughout the process of the model gradually achieving these abilities, although I've tried SFT, DPO, SimPO, ORPO, KTO... and various other training methods, I wouldn't say I've "infused knowledge" into the model's weights. Still, the model has ultimately learned how to "correctly interpret documents" and "understand Chinese," among other relevant capabilities (since most pretrained models perform better in English, and English can be considered their native language).

Interestingly, regardless of whether I chose Mistral, Llama, or Gemma architectures, and regardless of whether I used SFT, DPO, SimPO, ORPO, or KTO training methods, as long as I trained with data I carefully selected, the model would eventually converge to a stage that I deemed "acceptable" during my actual tests.

So, throughout this year of training models, I've come to realize that the most important factor is indeed the "training data." The saying "garbage in, garbage out" couldn't be more accurate. In the training process, the model essentially compresses and stores ways to respond to similar scenarios within the neural network's weights.

This reminds me of two things:

When I first joined the company, my then-supervisor liked to say that "AI models have a certain level of extrapolation capability, making them more flexible than rule-based designs" to explain why we used AI in our solution. It might have been a way to convince clients, but in hindsight, I believe that even though AI models can handle a certain degree of "similar scenarios," their "extrapolation" abilities might not be as high as we think.

A friend of mine did research on Zero-Shot Learning during his master's program. His research challenge was to let the model learn an X domain task with feature A, then learn a Y domain task with feature B, and finally perform well on a "Z domain task with A+B features."

This aligns with our expectations of AI's "extrapolation" capabilities. However, my friend faced many difficulties during those two years, as completing this task and having the AI model truly "understand" the underlying rules require a certain level of training data and resources, much like OpenAI's explanation of "emergence" in large language models.

Google DeepMind also once proposed research that allowed small-scale models to reach 100% accuracy on small tasks. However, even the early stages of validation loss experienced significant fluctuations, and achieving 100% required adequate generalization measures and training time.

I can't help but reflect: How exactly do humans "learn" things? Where do we surpass AI that allows us to manage such a complex scale of real-world life? It's a question that countless researchers throughout history have pondered endlessly!

That said, have humans really surpassed AI? Isn't our constructed knowledge system and daily life essentially our "training data"? In daily life, there are many things that provide us with "loss" (also a form of feedback).

For instance, as children, we learned to control our bodies through trial and error, feeling pain when we fell and gradually figuring out how to avoid getting hurt.

For instance, in learning, we gain awareness of where our understanding is flawed and what needs correction through discussions with teachers and classmates.

We are constantly receiving vast amounts of information and feedback at every moment.

Current models are still very passive, learning only a few common modalities like text and images—and they still need to be driven by humans. What I mean is that current models can't decide on their own what to do; they wait for an "input" and then provide an "output."

To make AI models more adaptable in handling different processes, the concept of "agents" has become so popular over the past year, in my opinion. (Note: An agent generally refers to combining different AI models or linking models with various prompt instructions to handle data flows. It's a way to generate "different inputs" for AI models to handle more complex tasks.)

Given these points, I think we still have a bit of a way to go before reaching AGI (Artificial General Intelligence) as people imagine. However, of course, compared to human history, we are now infinitely close to that moment. Every time I see a new AI research paper published, I feel excited, because no matter how small, the world is indeed moving towards that day.

One possibility is that we need to let AI simulate various ways to manipulate a body in a sandbox world, then transfer it to the real physical world after accumulating enough training data. Many researchers are working on this, and everyone is eagerly anticipating the day when this method is perfected enough to allow AI to serve humans in the physical world.

Another possibility is that we need AI models to always stay in a "thinking" state, while also being able to proactively propose things they want to do, rather than just waiting passively for human input.

This kind of training data is very difficult to obtain, as we usually can't record our complete mental activities.

Besides the data, challenges abound in energy consumption and hardware design—far from something that can be solved once the software and IT parts are in place.

It seems I've gone off on a tangent, but I think this is a worthwhile note to keep. Since I'll likely be devoted to research in the AI field for many years to come, I wonder what I'll think of this note when I revisit it a few years later?

Tags:

Leave a Reply