[Paper Reading] s1: Simple test-time scaling

Last Updated on 2025-07-01 by Clay

S1 Core Contributions

Test-Time Scaling has become a popular approach for enhancing LLM performance. The idea is to let the model “think” and organize its thoughts before providing an answer, resulting in improved accuracy.

The S1 paper made the following contributions:

Curated and released an open-source dataset named s1K
Proposed a method called Budget Forcing to control the computational cost of Test-Time Scaling
- Forced Halt: Immediately stops the reasoning process when the number of tokens exceeds a predefined limit
- Forced Generation: Inserts a [Wait] token at the end of the reasoning phase to compel the model to continue thinking

Used Qwen-2.5-Instruct as the base model and fine-tuned it on 16 H100 GPUs (only took 26 minutes) using SFT

Inference Dataset Construction

The research team initially collected 59K samples, but their goal was to achieve test-time scaling with minimal resources and the simplest approach. So they continuously cleaned the dataset until only 1K examples remained.

They used the following three criteria to select data:

Quality: Removed samples with API errors and then filtered out improperly formatted data
Difficulty: Compared responses from Qwen-2.5 7B and 32B against reference answers using Claude-3.5-Sonnet. If both models answered correctly, the sample was deemed too easy. They also measured token count as a proxy for reasoning complexity.
Diversity: Used Claude-3.5-Sonnet and the Mathematics Subject Classification (MSC) system to categorize data and ensure variety

Budget Forcing

The paper explored two methods of Test-Time Scaling: Sequential Scaling and Parallel Scaling.

Sequential Scaling: The model generates its reasoning process step-by-step, with each stage depending on the previous one. This linear format aligns with how humans typically think (although arguably not always!), and it’s easier to interpret. However, it’s susceptible to error propagation, especially when the model misinterprets earlier steps.
Parallel Scaling: The model generates multiple reasoning paths simultaneously and then merges and filters them. This method is faster and allows exploration of diverse possibilities, but it is more prone to hallucinations and less suitable for tasks requiring long, logical chains like mathematical proofs.

The S1 paper focuses primarily on Budget Forcing. From their experiments, especially on GPQA Diamond, it’s evident that sequential scaling is more suitable for such tasks.

The following image clearly illustrates how the [Wait] token is used to force the model to continue reasoning.

Personally, I’m quite curious about the optimal number of tokens for Budget Forcing. The paper provides experimental results based on AIME24.

It appears that performance gains saturate in the range of 4,096 to 8,192 tokens.

Evaluation Benchmarks

The research team evaluated the model on three different benchmarks:

AIME24 (American Invitational Mathematics Examination)
MATH500 (Competition-level math questions)
GPQA Diamond (PhD-level science questions)

At first glance, S1-32B’s scores may not appear impressive. However, it outperformed GPT-o1-preview on AIME24 and came close to GPT-o1 on MATH500 (arguably the easiest of the three benchmarks since scores were tightly grouped). On GPQA Diamond, it was comparable to GPT-o1-mini.

Compared to the original Qwen2.5-32B-Instruct, the improvements are evident across benchmarks:

AIME24: +30 percentage points
MATH500: +9 percentage points
GPQA Diamond: +10.6 percentage points

This highlights the effectiveness of the S1K dataset in boosting performance on logic-intensive tasks.

Another question I was curious about: how does the curated 1K dataset compare to the original 59K version? The paper includes an ablation study:

As shown, the s1K dataset outperformed all other variants (random selection, diversity-based selection, length-based selection), and even narrowly surpassed the full 59K version on MATH500.

This suggests that with just 1/60th of the original cost, comparable results can be achieved. It proves that Test-Time Scaling can enhance model performance even with a small, high-quality dataset.

Summary

In my view, the research team demonstrated two key insights: First, a small, clean, high-quality dataset can match the performance of large-scale datasets (though quality and diversity must both be considered). Second, Test-Time Scaling can be learned through such lightweight data.

That said, the second point may also be influenced by the fact that Qwen may have already been exposed to reasoning tasks during pretraining, making it easier for the SFT phase to elicit such abilities using a small dataset.

Next, I plan to read Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs to further investigate this possibility.

On a personal note, I’m curious to see how Gemma-2-9B would perform on the s1K dataset. Would it show similar gains across other benchmarks—or perhaps even regress?

References

[Paper Reading] ENTP: ENCODER-ONLY NEXT TOKEN PREDICTION

[Paper Reading] Lifting the Curse of Multilinguality by Pre-training Modular Transformers

[Paper Reading] s1: Simple test-time scaling

S1 Core Contributions

Inference Dataset Construction

Budget Forcing

Evaluation Benchmarks

Summary

References

Read More

Leave a ReplyCancel reply

[Paper Reading] s1: Simple test-time scaling

S1 Core Contributions

Inference Dataset Construction

Budget Forcing

Evaluation Benchmarks

Summary

References

Read More

Share this:

Leave a ReplyCancel reply