Stop letting tiny leaderboard gains fool you. Learn how to use statistical significance to tell if an AI model is truly better or just lucky.

A number without a margin of error isn't a measurement—it’s just an opinion. We need to stop being fooled by the decimals and use actual math to tell if a model is truly better or just lucky.
A small lead, such as a 0.5% difference between two models, is often statistically insignificant and may simply be the result of random noise. Because LLMs are probabilistic, their performance can fluctuate based on the specific sample of questions asked. Without calculating confidence intervals or margins of error, it is impossible to tell if Model A is truly superior to Model B or if it simply got "lucky" with a specific set of test questions.
Bootstrap resampling is a statistical technique used to quantify uncertainty, especially in small datasets where traditional methods like the Central Limit Theorem might fail. It involves creating thousands of simulated test sets by repeatedly picking questions from the original set with replacement. By observing how the model's score fluctuates across these thousands of variations, researchers can get a more honest picture of the model's stability and determine if a high score is fragile or robust.
Data is considered "clumpy" when test questions are not truly independent, such as when a benchmark includes twenty variations of the same logic puzzle or multiple questions derived from a single document. If a model is treated as having twenty independent successes for mastering one specific "clump," the margin of error will appear much smaller than it actually is. To fix this, researchers use "Clustered Standard Errors" to ensure they aren't double-counting successes from highly correlated data points.
Using a large model to grade a smaller one introduces subjective biases, such as "Verbosity Bias," where the judge favors longer answers regardless of quality, or "Position Bias," where the judge prefers whichever answer appears first in the prompt. There is also "Self-enhancement Bias," where a model might prefer responses that mimic its own training style. To mitigate this, evaluators must use techniques like swapping answer positions, providing "anchor examples" for grading, and validating a portion of the results with human labels.
Accuracy measures how often a model provides the correct answer, while calibration measures whether the model's reported confidence matches its actual probability of being correct. For example, if a model says it is 80% sure of an answer, a well-calibrated model should be right exactly 80% of the time. Many current models are overconfident due to "Instruction Tuning," meaning they sound certain even when they are guessing, which necessitates post-training recalibration techniques like "Temperature Scaling."
From Columbia University alumni built in San Francisco
"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."
"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."
"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."
"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."
"Reading used to feel like a chore. Now it’s just part of my lifestyle."
"Feels effortless compared to reading. I’ve finished 6 books this month already."
"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."
"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."
"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"
"It is great for me to learn something from the book without reading it."
"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."
"Makes me feel smarter every time before going to work"
From Columbia University alumni built in San Francisco
