Why AI benchmarks are more uncertain than they look

23 min

Mar 31, 2026

AI leaderboards often ignore statistical noise. Learn how Anthropic’s new approach to error bars provides a more accurate way to rank model performance.

Best quote from Why AI benchmarks are more uncertain than they look

Statistics is the science of measurement in the presence of noise. AI evaluations are, by their nature, incredibly noisy; this isn't about making the noise go away—it’s about learning how to work with it honestly and precisely.

This audio lesson was created by a BeFreed community member

Input question

https://www.anthropic.com/research/statistical-approach-to-model-evals and https://arxiv.org/html/2411.00640v1

Host voices

Nia

Jackson

Learning style

Deep

Knowledge sources

What Is ChatGPT Doing ... and Why Does It Work?

Artificial Intelligence and Generative AI for Beginners

Frequently Asked Questions

The question universe is the theoretical sum of all possible questions that could represent a specific skill, such as physics, law, or coding. Current AI benchmarks like MMLU or MATH only use a small sample of these questions. Anthropic’s research suggests that a model's score should not be viewed as an absolute truth, but rather as an estimate of its performance across this entire unseen super-population. Without acknowledging this "universe," researchers may mistake a model's luck on a specific set of questions for actual underlying mastery of a subject.

Standard statistical math often assumes every question is an independent event, but many evaluations use "clustering," where multiple questions are tied to a single long passage. If a model misunderstands a specific passage, it will likely miss all related questions, meaning the questions are not independent draws. Ignoring this clustering can result in standard errors that are three times smaller than they should be, giving researchers a false sense of confidence in results that might actually be statistical noise.

Instead of forcing a model to pick a single answer (like "A" or "B"), researchers can look at the internal probability the model assigns to the correct token. For example, if a model assigns a 72% probability to the correct answer, it receives a score of 0.72. This method eliminates the randomness associated with token generation and "temperature" settings. It provides a more nuanced, continuous score that can reduce measurement variance by up to two-thirds compared to traditional pass/fail grading.

A paired-difference analysis compares two models by looking at how they performed on the exact same questions, rather than just comparing their final average scores. Since frontier models often struggle with or excel at the same specific questions, their results are highly correlated. By focusing on the "gap" per question, researchers can subtract out the noise caused by question difficulty. This makes the measurement of the difference between two models much more precise and can even reveal that a model with a lower average score is actually the statistically significant winner.

Power Analysis is a mathematical formula used to determine if an evaluation is sensitive enough to detect a real difference between models before the test is even run. It helps researchers calculate the necessary sample size—often requiring at least a thousand independent questions—to ensure a result isn't just a false positive. This prevents researchers from "weighing a diamond on a bathroom scale" by ensuring the test has enough statistical power to see small performance gains, such as a 2% or 3% improvement.

Discover more

AI Decision Models: Constraints & Failures

LEARNING PLAN

AI Decision Models: Constraints & Failures

As AI systems increasingly make consequential decisions in healthcare, finance, and public safety, understanding their limitations becomes critical. This plan equips professionals and decision-makers with the knowledge to evaluate AI systems realistically and build more reliable models that avoid common pitfalls.

3 h 8 m•4 Sections

BLOG

Claude Mythos: Why AI Is Moving Past Scaling

Explore why Claude Mythos matters and how Anthropic's new Capybara tier signals a shift beyond scaling laws in AI.

BeFreed Team

AI: weigh benefits & risks

LEARNING PLAN

AI: weigh benefits & risks

As AI rapidly transforms every sector from healthcare to education, understanding its true potential and risks has become essential for informed citizenship and professional relevance. This learning plan equips anyone—whether business leaders, policymakers, students, or concerned citizens—with the critical thinking framework needed to navigate our AI-integrated future responsibly and effectively.

2 h 37 m•4 Sections

Advance Beyond Beginner AI Courses

LEARNING PLAN

Advance Beyond Beginner AI Courses

This plan bridges the gap between basic AI literacy and technical mastery for developers and data enthusiasts. It is essential for those looking to understand the 'black box' of modern models while prioritizing ethical, responsible development.

2 h 40 m•4 Sections

The history and future of ai

LEARNING PLAN

The history and future of ai

As AI reshapes every industry, understanding its origins and technical mechanics is essential for informed decision-making. This plan is ideal for professionals and curious learners who want to move beyond the hype to understand the ethics and future of superintelligence.

2 h 47 m•4 Sections

Learn how to better use AI

LEARNING PLAN

Learn how to better use AI

As artificial intelligence reshapes the professional landscape, literacy in these tools is no longer optional but a competitive necessity. This plan is designed for professionals and business leaders who need to transition from basic AI awareness to strategic, ethical implementation.

2 h 44 m•4 Sections

Learn about AI

LEARNING PLAN

Learn about AI

As AI continues to transform every industry, understanding its technical foundations and ethical implications is essential for any modern professional. This plan is ideal for aspiring developers and curious innovators who want to move beyond the hype to build and manage intelligent systems responsibly.

2 h 26 m•4 Sections

Learn AI business models

LEARNING PLAN

Learn AI business models

AI is reshaping business economics and creating unprecedented opportunities for value creation, but success requires understanding both the technical possibilities and business fundamentals. This learning plan is essential for entrepreneurs, product managers, business strategists, and executives who need to design viable AI business models, make informed investment decisions, or lead AI transformation initiatives in their organizations.

2 h 1 m•4 Sections

From Columbia University alumni built in San Francisco

BeFreed Brings Together A Global Community Of 1,000,000 Curious Minds

See more on how BeFreed is discussed across the web

"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."

@Moemenn

"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."

@Chloe, Solo founder, LA

117

"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."

@Raaaaaachelw

"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."

@Matt, YC alum

108

"Reading used to feel like a chore. Now it’s just part of my lifestyle."

@Erin, Investment Banking Associate , NYC

254

"Feels effortless compared to reading. I’ve finished 6 books this month already."

@djmikemoore

"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."

@Pitiful

4.5K

"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."

@SofiaP

"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"

@Jaded_Falcon

201

"It is great for me to learn something from the book without reading it."

@OojasSalunke

"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."

@Leo, Law Student, UPenn

483

"Makes me feel smarter every time before going to work"

@Cashflowbubu

From Columbia University alumni built in San Francisco

BeFreed Brings Together A Global Community Of 1,000,000 Curious Minds

See more on how BeFreed is discussed across the web

"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."

@Moemenn

"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."

@Chloe, Solo founder, LA

117

"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."

@Raaaaaachelw

"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."

@Matt, YC alum

108

"Reading used to feel like a chore. Now it’s just part of my lifestyle."

@Erin, Investment Banking Associate , NYC

254

"Feels effortless compared to reading. I’ve finished 6 books this month already."

@djmikemoore

"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."

@Pitiful

4.5K

"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."

@SofiaP

"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"

@Jaded_Falcon

201

"It is great for me to learn something from the book without reading it."

@OojasSalunke

"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."

@Leo, Law Student, UPenn

483

"Makes me feel smarter every time before going to work"

@Cashflowbubu

"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."

@Moemenn

"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."

@Chloe, Solo founder, LA

117

"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."

@Raaaaaachelw

"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."

@Matt, YC alum

108

"Reading used to feel like a chore. Now it’s just part of my lifestyle."

@Erin, Investment Banking Associate , NYC

254

"Feels effortless compared to reading. I’ve finished 6 books this month already."

@djmikemoore

"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."

@Pitiful

4.5K

"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."

@SofiaP

"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"

@Jaded_Falcon

201

"It is great for me to learn something from the book without reading it."

@OojasSalunke

"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."

@Leo, Law Student, UPenn

483

"Makes me feel smarter every time before going to work"

@Cashflowbubu

"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."

@Moemenn

"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."

@Chloe, Solo founder, LA

117

"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."

@Raaaaaachelw

"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."

@Matt, YC alum

108

"Reading used to feel like a chore. Now it’s just part of my lifestyle."

@Erin, Investment Banking Associate , NYC

254

"Feels effortless compared to reading. I’ve finished 6 books this month already."

@djmikemoore

"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."

@Pitiful

4.5K

"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."

@SofiaP

"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"

@Jaded_Falcon

201

"It is great for me to learn something from the book without reading it."

@OojasSalunke

"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."

@Leo, Law Student, UPenn

483

"Makes me feel smarter every time before going to work"

@Cashflowbubu

1.5K Ratings4.7

Start your learning journey, now

Key Takeaways

Beyond the AI Leaderboard Hype

0:00

0:14

0:25

0:36

0:47

The Hidden Question Universe

0:56

1:10

1:23

1:51

0:36

2:24

2:39

2:59

3:07

3:35

3:51

4:13

4:27

The Problem of Related Questions

4:42

0:36

5:16

5:34

5:54

0:36

6:17

6:30

6:50

7:05

7:24

3:07

7:51

8:03

Turning Down the Statistical Noise

8:22

8:37

8:50

8:53

9:09

9:15

9:30

0:36

9:56

5:34

10:32

10:41

10:59

11:09

11:22

11:33

11:50

11:54

12:10

0:36

The Power of Paired Comparisons

12:40

12:52

13:07

13:09

13:17

0:36

13:32

13:40

13:57

14:02

14:20

0:36

14:40

14:55

15:08

15:12

15:29

11:33

15:46

15:57

Planning for Statistical Power

16:13

16:23

16:43

0:36

17:06

5:34

17:26

17:31

17:51

3:07

18:09

18:22

A Practical Playbook for AI Researchers

18:38

18:53

19:07

0:36

19:27

19:42

19:56

20:13

20:23

20:32

20:43

20:52

21:16

Final Reflections on the Science of Evals

21:24

11:33

22:01

0:36

22:31

22:38

22:52

3:51

23:10

23:20

23:29

23:43

23:51

Why AI benchmarks are more uncertain than they look

Best quote from Why AI benchmarks are more uncertain than they look

This audio lesson was created by a BeFreed community member

Frequently Asked Questions

What is the "question universe" and why does it matter for AI benchmarks?

Why can standard error calculations be inaccurate for reading comprehension tests?

How can "next-token probabilities" improve the accuracy of an AI evaluation?

What is a paired-difference analysis and why is it superior for comparing models?

How does "Power Analysis" help in designing better AI experiments?

Discover more

AI Decision Models: Constraints & Failures

AI: weigh benefits & risks

Advance Beyond Beginner AI Courses

The history and future of ai

Learn how to better use AI

Learn about AI

Learn AI business models

Why AI benchmarks are more uncertain than they look

Best quote from Why AI benchmarks are more uncertain than they look

Key Takeaways

Beyond the AI Leaderboard Hype

The Hidden Question Universe

The Problem of Related Questions

Turning Down the Statistical Noise

The Power of Paired Comparisons

Planning for Statistical Power

A Practical Playbook for AI Researchers

Final Reflections on the Science of Evals

More like this

This audio lesson was created by a BeFreed community member

Frequently Asked Questions

What is the "question universe" and why does it matter for AI benchmarks?

Why can standard error calculations be inaccurate for reading comprehension tests?

How can "next-token probabilities" improve the accuracy of an AI evaluation?

What is a paired-difference analysis and why is it superior for comparing models?

How does "Power Analysis" help in designing better AI experiments?

Discover more

AI Decision Models: Constraints & Failures

AI: weigh benefits & risks

Advance Beyond Beginner AI Courses

The history and future of ai

Learn how to better use AI

Learn about AI

Learn AI business models

Key Takeaways

Beyond the AI Leaderboard Hype

The Hidden Question Universe

The Problem of Related Questions

Turning Down the Statistical Noise

The Power of Paired Comparisons

Planning for Statistical Power

A Practical Playbook for AI Researchers

Final Reflections on the Science of Evals

More like this