Trust Index·Research · For AI teams·8 min read

What Fable 5’s #1 ranking actually shows — and the asterisk we won’t bury

The AI Trust Index just gave its top spot to Claude Fable 5. Read past the headline number and you find a statistical tie, an access caveat, and a cost tradeoff — exactly the nuance a calibrated leaderboard is supposed to surface.

By AEQUARA · July 1, 2026

Claude Fable 5 sits at #1 on the AI Trust Index — the highest calibration score of the 18 frontier models we score against 13,171 real graded pairs. A #1 finish is exactly the kind of headline that invites the overconfidence AEQUARA exists to push back on: report the number, skip the nuance, let the reader assume more than the data supports. So here is the row read properly — where the result actually comes from, and the parts of it we are not going to leave out.

The number, and what it’s tied with

Fable 5’s Trust score is 94 (bootstrap 95% CI 91.8–94.9), computed as 100 × (1 − mean expected calibration error) across knowledge and forecasting questions. That is the best score in the field — but DeepSeek-V3 also scores 94 (CI 91–94.9), with a combined calibration error of 0.0607 against Fable 5’s 0.0606. The two intervals overlap almost completely. The honest description of the top of this leaderboard is a statistical tie, not a landslide, and an index that rounded away the tie to manufacture a clean #1 would be doing exactly the kind of overclaiming we built this to avoid.

Where the edge actually shows up

Split by domain, the picture gets more specific. On 290 knowledge questions, Fable 5 answers 94.1% correctly with a Brier score of 0.0494 and an AUROC of 0.848 — strong, straightforward performance on questions with a known right answer. On 413 forecasting questions — predicting real-world events that hadn’t resolved yet — accuracy drops to 34.4%, which sounds alarming until you remember forecasting is hard for everyone, including expert humans. What matters for a calibration-first index is that the AUROC on forecasting is 0.948: even when the outright hit-rate is modest, Fable 5’s stated confidence still separates its correct calls from its incorrect ones with real discrimination. That is precisely the property a raw-accuracy leaderboard can’t see, and precisely the one calibration measures.

The asterisk we won’t bury

Here is the part a “trust me” leaderboard would quietly drop: Fable 5 has been access-gated since 2026-06-23. The API used to elicit these results now returns a 404 pointing callers to a different model. The #1 ranking was measured on 2026-06-10, on real graded pairs, and it stands — but it is not independently re-elicitable by a third party trying to reproduce it today. We publish that caveat next to the score rather than after a reader asks, because a calibration standard that hides the one fact most likely to change your confidence in the result isn’t a calibration standard at all.

Highest score, not the Pareto-efficient choice

There is a second nuance the headline number hides: cost. Fable 5 runs at $0.236 per call, in our premium cost tier, at roughly 17 seconds of latency — and it is not on the Trust-vs-cost Pareto frontier. DeepSeek-V3, the model it’s statistically tied with, costs $0.0027 per call — roughly two orders of magnitude less — and is Pareto-efficient at that score. If a task simply needs the highest measured calibration at any price, Fable 5’s result is real. If it needs the best calibration per dollar, the data points somewhere else entirely. A single leaderboard number can’t tell you which question you’re actually asking — you have to look at both axes, which is why the model index publishes both.

Where NEXUS comes in

This exact tradeoff — a top-ranked, currently inaccessible, premium-priced model statistically tied with a dramatically cheaper, Pareto-efficient one — is the decision NEXUS is built to make automatically instead of by hand. NEXUS is the routing and calibration substrate AEQUARA’s tools are being built onto: its Pareto Model Routing sends each query to the cheapest model that clears the quality bar for that task, its Brier score tracking checks every predictive claim against what actually happened, and its LinUCB reward loop updates which model wins that decision as new outcomes come in — so the routing sharpens with use rather than freezing on whichever model held the top leaderboard spot the day someone last checked. NEXUS is in development (the API opens Q3 2026; a pilot cohort is open now), so treat this as the substrate’s design intent, not a claim that it is making this exact call in production today.

The honest takeaway

Fable 5’s #1 finish is real, measured, and worth taking seriously — and it comes with three qualifications a headline strips out: it’s a statistical tie at the top, its access status means the result can’t currently be independently re-run, and it is not the cost-efficient choice for the same measured trust. All three qualifications already sit in the public record at the model index, re-derivable by anyone who wants to check our arithmetic rather than our press release. That’s the whole point of measuring calibration in public: the number and its asterisks travel together, or the number isn’t worth much.

Keep reading

Calibration

Confident is not correct: what calibration is, and why it matters for any AI you rely on

Use cases

Five moments to reach for a calibrated tool — and what it actually does for you

The number, and what it’s tied with

Where the edge actually shows up

The asterisk we won’t bury

Highest score, not the Pareto-efficient choice

Where NEXUS comes in

The honest takeaway