Reading the AI Trust Index: what 13,171 graded forecasts say about today’s frontier models
We scored the leading models the way you’d score a forecaster — against what actually happened. Here is how to read the result, and why “smart” and “calibrated” turn out to be different axes.
Most AI leaderboards measure capability: can the model solve the problem? That’s worth knowing, but it leaves out the question that decides whether you can rely on a model in the open world — when it’s confident, is it right? The AI Trust Index measures that second thing. Here is how it works and how to read it.
The method, in one paragraph
We take frontier models’ probabilistic answers and score them against ground truth with a Brier score — the same proper scoring rule a serious forecaster is held to. The current run covers 18 frontier models across 13,171 scored pairs: 7,447 forecasting questions plus 5,724 knowledge multiple-choice questions. Three rules keep it honest: it is anti-issuer-pay (no model pays to be rated), it is error-published (the misses are on the record next to the hits), and the methodology is hash-stable, so the rules can’t quietly shift between runs.
The headline number
Out of sample, the calibrated engine scores a Brier of 0.1877 against an uninformed baseline of 0.25. The gap between those two numbers is the entire point: it is the measurable distance between “sounds confident” and “is actually informative.” A model that merely sounded sure would land near the baseline; one whose confidence carries real information moves below it.
Smart and calibrated are different axes
The most useful thing the index surfaces is that raw capability and calibration don’t move together as neatly as you’d expect. A model can be brilliant at hard reasoning and still systematically overstate its certainty; another can be more modest in raw power but unusually honest about what it doesn’t know. If you only ever look at capability leaderboards, you will keep being surprised by confident, fluent, wrong answers — because you measured the wrong axis.
How to read your model’s row
Don’t just rank by the headline. Look at the shape of the miss: is the model overconfident (says 90%, hits 70%) or underconfident (the rarer, safer failure)? Overconfidence is the one that hurts in production, because it’s the one that talks you into acting. A model you can deploy is one whose stated confidence you can take roughly at face value — and that is exactly what a calibration score, rather than an accuracy score, tells you.
Why we publish it in the open
A score you can’t check is just another assertion. So the index is public, the underlying record is hash-locked and recomputable in your browser, and we put our own errors on it. (The signature today is our own HMAC; an independent third-party anchor is in progress — the independence you get now comes from being able to re-derive the numbers yourself.) The same standard runs through everything we ship, from the consumer tools to the institutional attestation.
The fastest way to understand the idea is to feel it from the inside: the free Calibration Scorecard runs the same kind of measurement on your own judgment in two minutes. Then go read the index and find your model’s row.