article

Stanford Study: The “Clash of Optimizers” — AdamW Wins by Stability

4 min read

Since its introduction in 2014, Adam and its improved variant AdamW have dominated the pretraining of open-weight language models. Their ability to maintain stability and achieve fast convergence on massive datasets has made them the go-to choice in large-scale model training.


Why Optimizers Matter in Pretraining

As model sizes grow exponentially, pretraining has become one of the most compute-intensive bottlenecks in AI research, often representing the bulk of development cost. In this setting, the choice of optimizer directly determines convergence speed and training efficiency.

Researchers have proposed multiple directions of improvement. Some of the fastest alternatives adopt matrix-based preconditioners (e.g., Muon, Soap, Kron), offering 30–40% per-iteration speedups compared with a carefully tuned AdamW.

But the latest study from Stanford University’s Percy Liang group reveals a more nuanced picture:
while many optimizers claim 1.4–2× training speedups, AdamW remains the most robust and reliable choice for pretraining. Matrix-based optimizers do show clear advantages — but only under specific data-to-model scaling regimes.

Image


Why Previous Comparisons May Be Misleading

The Stanford team highlights two methodological flaws in prior claims of breakthrough performance:

1. Unfair Hyperparameter Tuning

Image

2. Insufficient Scale Testing

Image


The Study: A Systematic Benchmark

The researchers conducted a large-scale evaluation of 11 different optimizers, across model sizes from 130M to 1.2B parameters and data-to-model ratios ranging from 1× to 8× Chinchilla-optimal.

Image

Key Findings

Image


Methodology in Detail

The evaluation followed a three-phase methodology:

Phase I: Exhaustive Parameter Scans

Phase II: Identifying Sensitive Parameters

Image
Image

Phase III: Case Studies at Scale

Image


Conclusion: Stability Wins

The Stanford study offers a sobering takeaway:

For practitioners, the message is clear: fair hyperparameter tuning and large-scale testing matter far more than chasing the latest optimizer hype.