Trust Index·Research · For AI teams·7 min read

Reading the AI Trust Index: what 13,171 graded forecasts say about today’s frontier models

We scored the leading models the way you’d score a forecaster — against what actually happened. Here is how to read the result, and why “smart” and “calibrated” turn out to be different axes.

By AEQUARA · June 26, 2026

Most AI leaderboards measure capability: can the model solve the problem? That’s worth knowing, but it leaves out the question that decides whether you can rely on a model in the open world — when it’s confident, is it right? The AI Trust Index measures that second thing. Here is how it works and how to read it.

The method, in one paragraph

We take frontier models’ probabilistic answers and score them against ground truth with a Brier score — the same proper scoring rule a serious forecaster is held to. The current run covers 18 frontier models across 13,171 scored pairs: 7,447 forecasting questions plus 5,724 knowledge multiple-choice questions. Three rules keep it honest: it is anti-issuer-pay (no model pays to be rated), it is error-published (the misses are on the record next to the hits), and the methodology is hash-stable, so the rules can’t quietly shift between runs.

The headline number

Out of sample, the calibrated engine scores a Brier of 0.1877 against an uninformed baseline of 0.25. The gap between those two numbers is the entire point: it is the measurable distance between “sounds confident” and “is actually informative.” A model that merely sounded sure would land near the baseline; one whose confidence carries real information moves below it.

Smart and calibrated are different axes

The most useful thing the index surfaces is that raw capability and calibration don’t move together as neatly as you’d expect. A model can be brilliant at hard reasoning and still systematically overstate its certainty; another can be more modest in raw power but unusually honest about what it doesn’t know. If you only ever look at capability leaderboards, you will keep being surprised by confident, fluent, wrong answers — because you measured the wrong axis.

How to read your model’s row

Don’t just rank by the headline. Look at the shape of the miss: is the model overconfident (says 90%, hits 70%) or underconfident (the rarer, safer failure)? Overconfidence is the one that hurts in production, because it’s the one that talks you into acting. A model you can deploy is one whose stated confidence you can take roughly at face value — and that is exactly what a calibration score, rather than an accuracy score, tells you.

Why we publish it in the open

A score you can’t check is just another assertion. So the index is public, the underlying record is hash-locked and recomputable in your browser, and we put our own errors on it. (The signature today is our own HMAC; an independent third-party anchor is in progress — the independence you get now comes from being able to re-derive the numbers yourself.) The same standard runs through everything we ship, from the consumer tools to the institutional attestation.

The fastest way to understand the idea is to feel it from the inside: the free Calibration Scorecard runs the same kind of measurement on your own judgment in two minutes. Then go read the index and find your model’s row.

Keep reading

Calibration

Confident is not correct: what calibration is, and why it matters for any AI you rely on

Use cases

Five moments to reach for a calibrated tool — and what it actually does for you

The method, in one paragraph

The headline number

Smart and calibrated are different axes

How to read your model’s row

Why we publish it in the open