Personal Interpretation of Cogito Trained with Iterated Distillation and Amplification (IDA)

Last Updated on 2025-07-01 by Clay

Cogito V1 is a model I recently came across on Reddit that demonstrated impressive performance. It was also recommended by my colleagues just a day earlier. I decided to try it out on a RAG task I was working on, and the results were quite astonishing — most notably, it refrained from hallucinations when relevant reference materials were retrieved and was able to effectively synthesize information from multiple sources. Among the models I’ve tested, only Gemma-3 gave me a similar experience without requiring fine-tuning.

The blog post introducing the model and its (somewhat briefly explained) training method by the DeepCogito team can be found here: https://www.deepcogito.com/research/cogito-v1-preview

In short, they trained a series of large language models under the name Cogito, adopting a method known as Iterated Distillation and Amplification (IDA), which references a 2018 paper from OpenAI: Supervising strong learners by amplifying weak experts

According to the original paper, the training process should look like this:

This diagram already clearly illustrates the training flow of IDA.

First, a human worker H receives a question Q, breaks it down into sub-questions Q_1, Q_2, …, Q_n, and obtains answers to each using a supporting model X: X(Q_1), X(Q_2), …, X(Q_n). H then synthesizes these into a unified answer A.
The data accumulated from the previous step is used to train a simulated human worker H’. Given input Q, H’ should learn to break it down into sub-questions Q_1, Q_2, …, Q_n, get the answers from model X: X(Q_1), X(Q_2), …, X(Q_n), and finally compose answer A.
Finally, both Q and the synthesized A are used to train model X directly, allowing it to generate answer A upon receiving question Q.

However, the above procedure doesn’t fully explain how Cogito exhibits such strong reasoning abilities and prompt-based switching. Personally, I suspect the DeepCogito team may have introduced modifications to the training pipeline (though we’ll need to wait for their paper or a more detailed release to confirm).

One intuitive hypothesis is that they might collect all sub-questions and their respective answers, and use something like a “Disable/Enable deep thinking subroutine” mechanism to determine whether to engage model X in deeper reasoning during training.

In brief, the DeepCogito team views IDA as a scalable and efficient alignment strategy toward realizing General Super-Intelligence.

The motivation for adopting such a strategy lies in the limitations of current LLM training methods, which typically cap model performance at the level of the overseer or data sources. To surpass human-level intelligence, a methodology is needed that can transcend these bounds.

This is why I’m particularly eager to see the team publish more concrete details on how their approach deviates from the original IDA formulation.

Moreover, the blog post mentions that IDA is more efficient and scalable than other popular methods like RLHF or standard large-model distillation. For instance, the Cogito model was developed by a small team in just 75 days, and their 70B model reportedly outperforms distilled models such as Llama 3.3 70B and Llama 4 Scout 109B.

This claim continues to support a previously established fine-tuning path for smaller models: In the past, many large companies concluded that retraining smaller models was less effective compared to distillation from larger models.

Which approach is truly superior? Personally, I believe it’s still too early to draw firm conclusions, as the application and testing scenarios between the two may differ significantly.

That said, the DeepCogito blog frames IDA as a method to teach models to perform deeper “thinking” (amplification) using extra computation, then internalize this ability through distillation — and repeat this cycle to iteratively enhance the model’s intelligence, eventually breaking through the limits of human supervision and reaching general super-intelligence.

References

[Paper Reading] s1: Simple test-time scaling

[Paper Reading] ENTP: ENCODER-ONLY NEXT TOKEN PREDICTION

Personal Interpretation of Cogito Trained with Iterated Distillation and Amplification (IDA)

References

Read More

Leave a ReplyCancel reply

Personal Interpretation of Cogito Trained with Iterated Distillation and Amplification (IDA)

References

Read More

Share this:

Leave a ReplyCancel reply