article

Who Said the Scaling Law Has Hit a Ceiling? New Research Shows Tiny Gains Compound into Exponential Growth

5 min read

Scaling Law Is Not Done Yet

Many in the AI community have argued that the Scaling Law is showing diminishing returns, questioning the rationale of training ever-larger models. Recent observations, however, paint a different picture: even minimal improvements in single-step accuracy can compound to enable exponential task-length growth, which may carry more real-world value than raw benchmark scores suggest.

Image

While the marginal gains from increasing compute scale are decreasing, is it still reasonable for companies to invest heavily in larger models? This debate has intensified in the AI field over the past year.


Long-Horizon Task Performance Outweighs Short-Step Gains

A recent paper titled “The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs” offers a fresh perspective:

The research emphasizes that long-horizon task completion has historically been a weak spot for deep learning. Stunning demos in autonomous driving and image generation exist, but achieving long-term, coherent execution (like a multi-step project or a long video) remains a challenge. Companies increasingly demand AI to handle entire workflows, not just isolated questions. This raises a fundamental question: how do we measure the number of steps an LLM can reliably execute?


Execution vs. Planning

Failures in long tasks are often interpreted as reasoning limitations. While LLMs have made significant progress on complex reasoning benchmarks, some studies (e.g., The Illusion of Thought, arXiv:2506.06941) suggest models merely give the illusion of reasoning, ultimately failing as task length grows.

The new research argues for decoupling planning from execution:

Even when planning is perfect, execution errors accumulate with longer tasks. Execution, not planning, is the critical bottleneck for long-horizon LLM performance. As LLMs are increasingly deployed in extended reasoning or agent-based tasks, this distinction grows in importance.

Image


Key Findings

1. Scaling Still Matters

Although single-step accuracy improvements diminish, tiny gains compound, enabling exponential growth in task length. Larger models succeed in more rounds even when provided explicit knowledge and planning, showing that scaling benefits go beyond memory or search abilities.

2. Self-Conditioning Effect

Long-task failures aren’t only due to steady per-step errors. LLMs tend to increase their error rate over steps: if previous outputs contain mistakes, the likelihood of future mistakes rises. This contrasts with humans, who usually improve with practice. Self-conditioning is exacerbated by next-token prediction training, and scaling alone cannot resolve it.

3. Chain-of-Thought (CoT) Matters

Recent thinking models can correct self-conditioning. Sequential test-time compute boosts the task length achievable in a single round. For instance:

GPT-5’s thinking variant (Horizon) executes over 1,000 steps, far surpassing competitors like Claude-4-Sonnet at 432 steps.

Long-task execution failures should not be mistaken for reasoning deficiencies. Increased model scale and sequential compute allocation significantly improve long-horizon performance. Economic value often correlates with total task length rather than short benchmarks.

Image
Image


Detailed Methodology

Step Accuracy vs. Task Length

The authors analyze the relationship between single-step accuracy and task success, assuming:

  1. Step accuracy (p) is constant.
  2. No self-correction occurs.

Under these assumptions, the expected task length (H) at success probability (s) is derived as:

Image

For step accuracy above 70%, small gains lead to super-exponential task-length improvements, confirming that diminishing returns on short benchmarks do not imply stagnation in long-horizon tasks.

Image Image
Image


Isolating Execution by Decoupling Planning

Using a flight-booking agent example:

By providing all planning keys and associated knowledge, the study isolates execution as the variable, focusing on task length determined by rounds × steps per round (K).


Experimental Insights

Scaling Boosts Long-Horizon Execution

Even when world knowledge and planning are controlled:

Image

Self-Conditioning Explained

Image
Image

Task Length per Round

Image

GPT-5 Horizon demonstrates a massive lead over other models in single-round task length, and RL-trained thinking models outperform instruction-tuned counterparts.

Image


Conclusion

Long-horizon execution remains a major challenge, especially for open-source weights compared to API-only models. Scaling model size and enabling reasoning mechanisms (CoT, sequential compute) significantly increase the practical task length, overturning assumptions of diminishing returns. This study encourages developing benchmarks focused on execution depth to accurately measure LLM scaling benefits.