Why LLM Leaderboards Are Often Wrong

28 min

Mar 31, 2026

Small score gaps in model evals might just be noise. Learn how to use statistical error bars and rigor to determine if your model is actually better.

Best quote from Why LLM Leaderboards Are Often Wrong

The biggest red flag in AI right now isn't a low score—it’s a high score with no error bars. We need to stop treating evals like static scores and start treating them like the scientific experiments they actually are.

This audio lesson was created by a BeFreed community member

Input question

Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations Evan Miller Anthropic evanmiller@anthropic.com Abstract Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze...

Host voices

Nia

Eli

Learning style

Deep

Knowledge sources

Hands-on Machine Learning With Scikit-learn And Tensorflow

Frequently Asked Questions

Ranking models by tiny margins—such as a 0.5% difference—is often misleading because these fluctuations may simply be statistical noise rather than a reflection of true capability. Evaluation datasets are finite samples pulled from a theoretical "super-population" of all possible questions. Without calculating error bars or standard error, it is impossible to know if a higher score is a significant result or if the ranks would flip if the experiment were run again with different questions or different model seeds.

The Rule of Three is a statistical guideline used when a model passes every single test in a small sample size. If you run 30 safety tests and the model never fails, it is mathematically incorrect to claim the model is 100% safe. Instead, the rule dictates that the 95% confidence upper bound for the failure rate is 3 divided by the number of tests. In a 30-test scenario, you can only say with 95% confidence that the failure rate is below 10% in the wild.

Standard statistical assumptions require that every question in a dataset be independent, but real-world benchmarks often violate this by using multiple questions based on the same document or translating the same prompt into different languages. If a model struggles with the underlying context, it will likely fail all related questions, meaning they are not independent "votes" on performance. Clustered Standard Errors account for this correlation by grouping related items, preventing researchers from underestimating uncertainty and reporting artificially small error bars.

One of the most effective ways to shrink error bars is to use continuous metrics like "logprobs" (log probabilities) instead of binary pass/fail scores. By looking at the probability the model assigned to the correct answer rather than whether it happened to sample that answer, you eliminate "within-question" variance caused by the model's internal randomness. Other strategies include resampling (averaging multiple completions for the same prompt) and averaging results across the final few checkpoints of a training run to smooth out lucky fluctuations in model weights.

Comparing two separate error bars is often too conservative; models can have overlapping confidence intervals and still show a statistically significant difference. A paired difference test evaluates both models on the exact same set of questions and focuses on the gap between their scores. Because models usually agree on which questions are difficult, their scores are positively correlated. Subtracting these correlated variables shrinks the variance of the difference, making the test much more sensitive and capable of detecting real improvements that a naive comparison would miss.

Discover more

Python programming for LLMs and evals

LEARNING PLAN

Python programming for LLMs and evals

As AI integration becomes standard, the ability to both build and critically evaluate models is a vital technical differentiator. This path is ideal for developers and data scientists looking to transition from general programming to specialized LLM engineering and rigorous model benchmarking.

3 h 3 m•4 Sections

I want to learn the fundamentals of LLMs

LEARNING PLAN

I want to learn the fundamentals of LLMs

Large Language Models are revolutionizing how we interact with technology and information. This learning plan provides essential knowledge for developers, AI enthusiasts, and professionals who want to understand LLM capabilities, limitations, and future potential, enabling them to make informed decisions about implementing and working with this transformative technology.

1 h 56 m•4 Sections

Master ML Research in LLMs, NLP & Quant Fin

LEARNING PLAN

Master ML Research in LLMs, NLP & Quant Fin

This comprehensive track bridges the gap between theoretical machine learning research and high-stakes applications in NLP and quantitative finance. It is ideal for aspiring researchers, data scientists, and quantitative analysts looking to master the architectures behind LLMs and algorithmic trading systems.

3 h 42 m•4 Sections

Neural Networks and LLM

LEARNING PLAN

Neural Networks and LLM

This learning plan is essential for developers and data scientists looking to transition from basic machine learning to state-of-the-art generative AI. It bridges the gap between theoretical mathematics and practical implementation, making it ideal for those who want to build or fine-tune their own large language models.

2 h 53 m•4 Sections

Read academic articles vs Facebook posts

LEARNING PLAN

Read academic articles vs Facebook posts

In an era where social media algorithms prioritize engagement over accuracy, the ability to distinguish credible research from misinformation is essential for making informed decisions. This learning plan is ideal for anyone tired of falling for misleading headlines and wanting to develop critical thinking skills to navigate the overwhelming amount of information online with confidence and discernment.

1 h 49 m•4 Sections

Math for Stats, Probability & ML

LEARNING PLAN

Math for Stats, Probability & ML

This learning plan bridges the gap between theoretical mathematics and practical implementation in data science and AI. It is ideal for aspiring data scientists or engineers who want to move beyond using libraries and truly understand the logic driving machine learning models.

2 h 49 m•4 Sections

Advance probability

LEARNING PLAN

Advance probability

This plan bridges the gap between basic chance and high-level statistical modeling. It is ideal for data scientists, analysts, and decision-makers looking to master uncertainty and predictive accuracy in professional environments.

2 h 25 m•4 Sections

ML Eng: Math, Biz, Polyglot & Soft Skills

LEARNING PLAN

ML Eng: Math, Biz, Polyglot & Soft Skills

This comprehensive path is designed for engineers looking to evolve into senior ML leaders by blending technical depth with business acumen. It bridges the gap between low-level mathematical implementation and high-level strategic influence, making it ideal for those aiming to drive real-world impact in the AI industry.

3 h 7 m•4 Sections

From Columbia University alumni built in San Francisco

BeFreed Brings Together A Global Community Of 1,000,000 Curious Minds

See more on how BeFreed is discussed across the web

"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."

@Moemenn

"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."

@Chloe, Solo founder, LA

117

"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."

@Raaaaaachelw

"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."

@Matt, YC alum

108

"Reading used to feel like a chore. Now it’s just part of my lifestyle."

@Erin, Investment Banking Associate , NYC

254

"Feels effortless compared to reading. I’ve finished 6 books this month already."

@djmikemoore

"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."

@Pitiful

4.5K

"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."

@SofiaP

"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"

@Jaded_Falcon

201

"It is great for me to learn something from the book without reading it."

@OojasSalunke

"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."

@Leo, Law Student, UPenn

483

"Makes me feel smarter every time before going to work"

@Cashflowbubu

From Columbia University alumni built in San Francisco

BeFreed Brings Together A Global Community Of 1,000,000 Curious Minds

See more on how BeFreed is discussed across the web

"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."

@Moemenn

"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."

@Chloe, Solo founder, LA

117

"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."

@Raaaaaachelw

"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."

@Matt, YC alum

108

"Reading used to feel like a chore. Now it’s just part of my lifestyle."

@Erin, Investment Banking Associate , NYC

254

"Feels effortless compared to reading. I’ve finished 6 books this month already."

@djmikemoore

"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."

@Pitiful

4.5K

"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."

@SofiaP

"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"

@Jaded_Falcon

201

"It is great for me to learn something from the book without reading it."

@OojasSalunke

"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."

@Leo, Law Student, UPenn

483

"Makes me feel smarter every time before going to work"

@Cashflowbubu

"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."

@Moemenn

"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."

@Chloe, Solo founder, LA

117

"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."

@Raaaaaachelw

"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."

@Matt, YC alum

108

"Reading used to feel like a chore. Now it’s just part of my lifestyle."

@Erin, Investment Banking Associate , NYC

254

"Feels effortless compared to reading. I’ve finished 6 books this month already."

@djmikemoore

"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."

@Pitiful

4.5K

"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."

@SofiaP

"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"

@Jaded_Falcon

201

"It is great for me to learn something from the book without reading it."

@OojasSalunke

"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."

@Leo, Law Student, UPenn

483

"Makes me feel smarter every time before going to work"

@Cashflowbubu

"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."

@Moemenn

"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."

@Chloe, Solo founder, LA

117

"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."

@Raaaaaachelw

"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."

@Matt, YC alum

108

"Reading used to feel like a chore. Now it’s just part of my lifestyle."

@Erin, Investment Banking Associate , NYC

254

"Feels effortless compared to reading. I’ve finished 6 books this month already."

@djmikemoore

"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."

@Pitiful

4.5K

"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."

@SofiaP

"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"

@Jaded_Falcon

201

"It is great for me to learn something from the book without reading it."

@OojasSalunke

"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."

@Leo, Law Student, UPenn

483

"Makes me feel smarter every time before going to work"

@Cashflowbubu

1.5K Ratings4.7

Start your learning journey, now

Key Takeaways

Beyond the Illusion of Leaderboards

0:00

0:18

0:33

0:42

0:50

The Super-Population and the Finite Sample

1:00

1:26

1:35

1:53

2:03

2:28

0:42

2:56

3:06

3:21

3:28

3:44

3:28

4:10

4:20

The Central Limit Theorem as Our Safety Net

4:38

4:57

5:18

5:28

5:45

3:28

6:23

6:32

6:51

0:42

7:17

3:06

7:44

3:28

8:22

3:06

8:42

When Independence Fails and Clusters Emerge

8:53

9:05

3:28

9:28

9:37

9:53

10:04

10:24

10:29

10:48

10:59

11:17

11:23

11:41

11:47

12:12

Strategies for Shrinking the Wiggle

12:18

12:38

12:48

3:28

13:13

13:17

13:27

13:34

13:57

14:04

14:15

3:28

14:37

14:48

15:01

15:19

15:31

The Art of the Fair Comparison

15:44

15:57

16:02

16:14

16:21

16:36

16:42

16:53

16:58

17:14

17:18

17:43

17:49

18:05

3:28

18:31

9:37

18:50

3:06

19:21

Planning for Success with Power Analysis

19:32

19:44

20:00

3:28

20:26

20:36

20:56

21:03

21:15

16:42

21:49

21:56

22:10

3:28

22:34

22:43

A Practical Playbook for the Listener

22:55

23:09

23:27

23:43

24:01

24:15

24:32

24:44

25:01

25:17

25:27

3:28

Moving Toward a Culture of Rigor

25:46

26:04

26:18

26:34

26:49

27:01

27:14

27:25

27:31

27:45

27:51

28:06

28:10

Why LLM Leaderboards Are Often Wrong

Best quote from Why LLM Leaderboards Are Often Wrong

This audio lesson was created by a BeFreed community member

Frequently Asked Questions

Why is it misleading to rank AI models based on small decimal differences in leaderboard scores?

What is the "Rule of Three" in the context of model safety and red-teaming?

How do "Clustered Standard Errors" improve the accuracy of an evaluation?

How can developers reduce "wiggle room" or variance in their scores without increasing the number of test questions?

Why is a "Paired Difference" test superior to simply comparing two separate confidence intervals?

Discover more

Python programming for LLMs and evals

I want to learn the fundamentals of LLMs

Master ML Research in LLMs, NLP & Quant Fin

Neural Networks and LLM

Read academic articles vs Facebook posts

Math for Stats, Probability & ML

Advance probability

ML Eng: Math, Biz, Polyglot & Soft Skills

Why LLM Leaderboards Are Often Wrong

Best quote from Why LLM Leaderboards Are Often Wrong

Part of a Learning Plan

Python programming for LLMs and evals

Key Takeaways

Beyond the Illusion of Leaderboards

The Super-Population and the Finite Sample

The Central Limit Theorem as Our Safety Net

When Independence Fails and Clusters Emerge

Strategies for Shrinking the Wiggle

The Art of the Fair Comparison

Planning for Success with Power Analysis

A Practical Playbook for the Listener

Moving Toward a Culture of Rigor

More like this

This audio lesson was created by a BeFreed community member

Frequently Asked Questions

Why is it misleading to rank AI models based on small decimal differences in leaderboard scores?

What is the "Rule of Three" in the context of model safety and red-teaming?

How do "Clustered Standard Errors" improve the accuracy of an evaluation?

How can developers reduce "wiggle room" or variance in their scores without increasing the number of test questions?

Why is a "Paired Difference" test superior to simply comparing two separate confidence intervals?

Discover more

Python programming for LLMs and evals

I want to learn the fundamentals of LLMs

Master ML Research in LLMs, NLP & Quant Fin

Neural Networks and LLM

Read academic articles vs Facebook posts

Math for Stats, Probability & ML

Advance probability

ML Eng: Math, Biz, Polyglot & Soft Skills

Part of a Learning Plan

Python programming for LLMs and evals

Key Takeaways

Beyond the Illusion of Leaderboards

The Super-Population and the Finite Sample

The Central Limit Theorem as Our Safety Net

When Independence Fails and Clusters Emerge

Strategies for Shrinking the Wiggle

The Art of the Fair Comparison

Planning for Success with Power Analysis

A Practical Playbook for the Listener

Moving Toward a Culture of Rigor

More like this