What the research actually says about confident AI — a calibration reading list
Six findings, from a 1950 weather-forecasting score to modern neural networks, that explain why fluent and right are different things — and why the fix is older than the problem.
“Calibration” can sound like a new word for an AI-era problem. It isn’t. The question of whether stated confidence matches real accuracy has a long, well-documented research history, and almost everything worth knowing about today’s confident-but-wrong models was anticipated by work done decades before the first chatbot. Here is a short reading list — six findings, in order of appearance — that together explain the whole problem and most of the fix. No new claims; just the literature, plainly summarised.
1950 — the score that started it (Brier)
Glenn Brier, working on weather forecasts, proposed scoring a probabilistic prediction by the squared distance between the stated probability and what actually happened (Brier, 1950). The deep property — the reason it is still the default seventy-five years later — is that it is a proper scoring rule: you minimise it, on average, only by reporting your true belief. You cannot improve your number by sounding more or less sure than you are. Every honest measurement of confidence traces back to this idea.
1973 — confidence has parts (Murphy)
Allan Murphy showed that a Brier score decomposes cleanly into three terms: reliability (do your 70%s actually come true 70% of the time?), resolution (do you say different things about different cases, or hedge everything to the base rate?), and the irreducible uncertainty of the events themselves (Murphy, 1973). This is why “is it calibrated?” and “is it useful?” are two different questions. A model can be perfectly reliable and still worthless if it has no resolution — like a clock that’s honestly unsure of the time.
1982 — people are overconfident, and worse on hard questions (Lichtenstein, Fischhoff & Phillips)
The classic survey of human probability judgment found a robust pattern: people assign more certainty to their answers than their accuracy warrants, and the gap widens on harder questions — the “hard–easy effect” (Lichtenstein, Fischhoff & Phillips, 1982). This matters for AI because the same failure shows up exactly where it does the most damage: the difficult, high-stakes judgments you most wanted help with are the ones where confidence is least trustworthy.
2005 — experts barely beat the baseline (Tetlock)
Philip Tetlock’s long-running study of expert political forecasting reached a famously uncomfortable conclusion: aggregate expert predictions struggled to beat simple statistical baselines, and the most confident, media-friendly experts were often the least accurate (Tetlock, 2005). His distinction between “hedgehogs” (one big idea, applied everywhere, loudly) and “foxes” (many small models, held provisionally) is really a distinction about calibration: the foxes were better calibrated because they kept score against reality.
2008 — overconfidence isn’t one thing (Moore & Healy)
Don Moore and Paul Healy untangled “overconfidence” into three distinct phenomena: overestimation (thinking you did better than you did), overplacement (thinking you’re better than others), and overprecision (excessive certainty in your beliefs) — and showed they don’t always move together (Moore & Healy, 2008). For anyone evaluating an AI, overprecision is the dangerous one: a system can be appropriately modest about ranking itself and still be far too sure of each individual answer.
2017 — modern neural networks made it worse (Guo et al.)
Here the human story becomes the machine story. Guo, Pleiss, Sun and Weinberger found that modern deep neural networks — for all their accuracy gains — are more poorly calibrated than the smaller, older networks that preceded them, systematically overstating their confidence (Guo et al., 2017). Their proposed fix, temperature scaling, is almost embarrassingly simple: leave the predictions alone and just soften the confidence by a single learned factor. The lesson generalises: capability and calibration are different axes, and bigger does not buy you honesty for free.
The through-line
Read together, the arc is clear. We have a precise, un-gameable way to measure confidence (Brier), a way to see what it’s made of (Murphy), seventy years of evidence that both people and models are overconfident by default (Lichtenstein, Tetlock, Moore & Healy, Guo), and cheap, known methods to correct it. The problem is old and the tools exist; what’s usually missing is the discipline of actually scoring claims against outcomes and publishing the result. That gap is the whole reason AEQUARA exists.
The fastest way to internalise all six findings is to catch yourself doing it: the free Calibration Scorecard measures your own overconfidence in two minutes, and the AI Trust Index applies the same 1950 instrument to today’s frontier models, in public. The research is settled; the only question is whether anyone keeps score.