Lena 和 Miles 揭秘大模型评估中被忽视的统计误差,指出榜单微弱分差可能只是随机噪音。通过引入置信区间和配对实验等科学方法,教你如何穿透排名乱象,看清模型真正的技术实力。

评估模型其实是一场统计实验,但大家现在玩得太粗糙了。我们真正关心的不只是模型在固定题库里的得分,而是它处理所有可能任务的真实期望水平。
https://arxiv.org/pdf/2411.00640


大模型评估本质上是一场统计抽样实验。榜单上的题目只是从无限的“超总体”中抽取的样本,因此得分会受到随机噪声的影响。如果两个模型的分数差距小于统计学上的“误差线”或置信区间,这种领先可能仅仅是由于题目选择的随机性导致的波动,而非模型真实实力的体现。
聚类效应是指在评估集中,多个题目可能关联到同一个素材(如一段阅读理解材料后的十道题)。如果忽略这种关联性,将它们视为完全独立的样本,会使得计算出的误差范围比真实情况小得多(有时甚至小三倍)。这意味着研究者可能会产生一种“测量很精确”的错觉,从而误将随机噪声当成显著的性能提升。
虽然将温度调至 0 可以消除输出的随机性,但这会改变模型的行为,使其变得死板甚至陷入重复,无法反映模型在真实应用场景中的表现。此外,强行将概率分布“四舍五入”为确定性输出可能会引入偏差,导致测得的分数虽然稳定,但却是错误或具有误导性的。
一种有效的方法是使用“下一个 Token 的概率”(Next-token probabilities)来直接计算得分,这相当于对模型进行了无数次重采样,能显著降低方差。如果必须生成答案,则可以采用“重采样”策略,即让模型对同一道题回答多次(如 4 到 6 次)并取平均分,以消除大部分随机采样带来的噪音。
配对分析通过计算两个模型在“每一道题”上的分差来抵消题目难度带来的干扰。因为两个模型在同一套题中面临的难度波动是同步的,通过分析分差而非绝对总分,可以利用题目间的相关性来大幅缩小误差范围。这种方法能让原本看起来模糊的差距在统计学上变得清晰且显著。
From Columbia University alumni built in San Francisco
"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."
"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."
"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."
"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."
"Reading used to feel like a chore. Now it’s just part of my lifestyle."
"Feels effortless compared to reading. I’ve finished 6 books this month already."
"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."
"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."
"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"
"It is great for me to learn something from the book without reading it."
"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."
"Makes me feel smarter every time before going to work"
From Columbia University alumni built in San Francisco
