Leaderboards often ignore margins of error. Learn how to use power analysis to find out which AI models actually perform best.

We need to stop treating evals as just a series of contests and start seeing them as statistical experiments drawn from an unseen super-population of questions. When we report a single percentage without error bars, we’re acting like the bucket is the ocean.
https://arxiv.org/html/2411.00640v1


The super-population perspective shifts the view of benchmarks from fixed contests to statistical experiments. Instead of treating a benchmark as a final set of questions, it views those questions as a small sample drawn from an infinite "ocean" of all possible questions on a subject. The goal of an evaluation is not to see how a model performs on specific items, but to use that sample to estimate the model’s "true" underlying skill level across the entire potential population of questions.
Reporting a single number without error bars or confidence intervals ignores the statistical noise inherent in sampling. A model might appear to be "state-of-the-art" simply because it got lucky with a specific bucket of questions, rather than possessing superior underlying skill. Without quantifying precision through standard errors, researchers cannot determine if a small lead (such as 1% or 2%) is a genuine improvement or merely a result of random variation in the question set or the model's non-deterministic nature.
Clustering occurs when multiple questions in a benchmark are related to the same source, such as ten different questions based on a single reading passage. If a model fails to understand the core passage, it will likely miss all associated questions, meaning those data points are not independent. If researchers fail to use "clustered standard errors" to account for this correlation, they will significantly overstate the precision of their results, sometimes producing error bars that are three times too narrow.
Paired analysis is a method of comparing two models by looking at the difference in their performance on every individual question rather than just comparing their final average scores. Because models often find the same questions "easy" or "hard," their scores are correlated. By focusing on the specific gap between models on a per-question basis, researchers can cancel out the "background noise" of question difficulty, which can reduce the variance of the comparison by as much as a third without requiring any new data.
Power Analysis is a statistical tool used to determine the probability that an experiment will detect a real difference between models if one exists. It helps researchers calculate the minimum number of questions (n) needed to reliably identify a specific improvement level, known as the Minimum Detectable Effect (MDE). For example, a small benchmark with only 164 questions may have such low "power" that it is physically incapable of statistically proving a 5% improvement, rendering the evaluation a "blunt instrument" that produces unreliable rankings.
From Columbia University alumni built in San Francisco
"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."
"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."
"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."
"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."
"Reading used to feel like a chore. Now it’s just part of my lifestyle."
"Feels effortless compared to reading. I’ve finished 6 books this month already."
"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."
"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."
"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"
"It is great for me to learn something from the book without reading it."
"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."
"Makes me feel smarter every time before going to work"
From Columbia University alumni built in San Francisco
