Former OpenAI researcher Diogo Almeida published a detailed analysis on July 4 confirming that OpenAI’s foundational 2020 scaling laws, the Kaplan et al. paper that shaped how the entire industry trained large language models, contained a bug. The error caused labs to build models that were too large and trained on too little data for roughly two years, until DeepMind’s Chinchilla paper corrected course in 2022.
The Three-Part Bug
Almeida, who worked on LLM optimization at OpenAI during the period in question, breaks the bug into three compounding mistakes in the original Kaplan et al. paper.
First, every model in the study was trained on a fixed amount of data: approximately 130 billion tokens regardless of model size. Since scaling laws predict that data requirements should grow with model size, small models received far more training relative to their capacity than large ones. The fixed data ceiling meant larger models never got the training they needed to demonstrate what more data could do.
Second, the researchers used a cosine learning rate schedule that decayed to zero at the target token count. This caused performance to plateau artificially near the end of training, making it appear that additional data would not help. In reality, larger models would have continued improving with more data and a different learning rate schedule.
Third, the paper claimed results were “largely independent of learning rate schedule.” Almeida notes this conclusion was technically accurate given the fixed token budget but did not hold in the data-infinite regime that scaling laws aim to model.
The Industry Consequence
The combined effect was a clear prescription: prioritize model size over data. The original scaling laws concluded that the optimal number of parameters scales as the 0.73rd power of compute, making parameters the primary variable to increase. That logic produced GPT-3’s 175 billion parameter architecture.
DeepMind’s Chinchilla paper in 2022 overturned this entirely. Chinchilla, a 70 billion parameter model trained on 1.4 trillion tokens, outperformed Gopher, a 280 billion parameter model trained on only 300 billion tokens, using the same compute budget. The corrected scaling law showed that models and data should scale roughly equally, with approximately 20 tokens per parameter being optimal.
Sander Dieleman, the DeepMind researcher known for his work on diffusion models, confirmed Almeida’s analysis, calling it “an interesting anecdote about LLMs” and noting the original scaling law “likely caused the industry to waste massive computational resources on a multitude of models that were too large but undertrained.”
Not the Only Bug in the Chain
Almeida’s post was prompted by OpenAI researcher Lilian Weng’s June 24 blog post “Scaling Laws, Carefully,” which surveyed the mainstream explanation for the discrepancy between Kaplan and Chinchilla. That explanation, from Besiroglu et al. (2024), attributed the difference to how total parameters were counted. Almeida says that follow-up research is “unfortunately inaccurate, though not due to any fault of the authors.” The actual cause was the bug he describes.
Almeida also notes that Chinchilla’s own fitting method was not clean: a 2024 reanalysis found that the loss scale in the optimizer was set too high and the Huber loss was averaged over samples instead of summed, causing premature fitting termination. “The paper fixes one bug but introduces another,” he writes.
The Cost Estimate Problem
Neither Almeida nor Dieleman put a precise dollar figure on the wasted compute. The claim of “millions of compute hours” circulating in coverage from KuCoin and Chinese tech media is directionally correct but unquantified. Between 2020 and 2022, every major lab training large models was following the Kaplan recipe: stack parameters, use less data. GPT-3, Gopher, and their derivatives all reflect this approach. The aggregate GPU hours spent training oversized models on insufficient data across the industry during that window is substantial, but no independent estimate has been published.
Why It Took This Long
Almeida acknowledges he was at OpenAI during this period and missed the bug himself. “The learning rate schedule seemed so obviously an important hyperparameter that it looked intentionally set,” he writes. The bug was eventually discovered but, according to Almeida, “not explicitly acknowledged” by OpenAI. He closes by noting that every major AI lab has known about this for some time and calls for the original paper to be amended with a note.
The disclosure lands at a moment when scaling laws are under renewed scrutiny. Labs are spending billions on training runs, and the question of whether compute is being allocated optimally has direct implications for model economics, agent deployment costs, and infrastructure planning across the industry.