Skip to main content
Home  ›  Trust  ›  Model Calibration Index
Surface 2 · live measurement

Which models we trust — and how much.

AEQUARA verdicts come from a panel of frontier models. Before we weight any model's vote, we measure what the public leaderboards ignore: calibration — whether a model's stated confidence matches how often it's actually right. This is the measured scoreboard behind the panel — a full calibration benchmark you can re-rank by any axis, not a single number.

Real, measured data — not a vibes-check. Every number is computed from confidence-scored predictions (knowledge MCQs + resolved real-world events) by trust_index.py over committed, auditable source data — reported with bootstrap 95% confidence intervals, not point estimates. This is the first AEQUARA Trust surface backed by genuine measurement, and it needed no proprietary or human-panel data. It delivers Constitution Commitment 01: calibrated, not theatrical.
Models measured
frontier LLMs, all providers
Predictions scored
knowledge + resolved events
Axes you can rank by
9 ×2
calibration·tail·cost… × 2 substrates
Cross-domain transfer
+0.73 r
perm-p 0.0005 · calibration is a trait

The calibration leaderboard

Ranked by Trust by default — or re-rank by any axis below (honest confidence, tail risk, discrimination, selective risk, cost…) and per individual domain. Click any row for its reliability diagrams and full metric suite.

Rank by
Domain

Model calibration leaderboard — sortable by column; each row expands to its reliability diagrams and full metric suite
Loading measured calibration data…
Across releases · does each generation get more calibrated?

Same family, version over version. The gold-bordered card is the most honest release (forecasting slope closest to 1.0). The answer isn’t a clean yes — calibration shifts across releases rather than climbing monotonically, often trading forecasting honesty for knowledge accuracy. Bootstrap CIs overlap, so treat modest gaps as ties; the directional trade-offs are real and a capability leaderboard can’t see them.

How to read this index
Trust Score ± CI Confidence reliability 0–100, with a bootstrap 95% interval. Overlapping intervals mean two models are statistically indistinguishable — most of the frontier is.
Re-rank by any axis One ranking is a simplification. Sort by honest confidence (slope), tail risk, discrimination, selective risk or cost — the winner changes with the question you're asking.
ECE / adaptive ECE Expected Calibration Error — average gap between stated confidence and observed frequency. Adaptive uses equal-mass bins (robust to skew). Lower is better.
Calibration slope From a logistic fit. 1.0 = perfectly calibrated direction; <1 = over-confident; >1 = under-confident. Most of the frontier sits below 1 on forecasts — but not all.
Tail & selective risk Tail ≥90% asks "when it claims near-certainty, how often right?" Selective risk (AURC) asks "if it only answered its most-confident questions, how good?"
Reliability diagram The curve in each row's detail. Points on the diagonal = perfectly calibrated; below = over-confident. The shaded gap is the miscalibration AEQUARA corrects for. The marks the Trust-vs-cost efficient frontier.

What this establishes

Findings a buyer of trust-scoring can act on — each grounded in the measured suite above.

Finding 01

One ranking is never enough

The best-calibrated model on average (Trust) is not the one you'd trust at near-certainty, nor the best per dollar. Re-rank by honest confidence and the Trust #1 can fall toward the bottom — which is exactly why a single leaderboard number is a simplification.

Finding 02

Most of the frontier is over-confident — but a calibrated tier is emerging

Calibration slope sits below 1.0 for most models on forecasting, yet the current flagships are catching up — Sonnet 4.6 (1.09) and Gemini Pro (0.99) are essentially honest, while cheap models stay sharply over-confident (GPT-4o mini 0.42). Rule: discount stated confidence most on the cheap models.

Finding 03

Calibration is orthogonal to capability — and to cost

Accuracy barely separates the field; calibration spans a wide range, and AUROC tells yet another story. The Trust-vs-cost ★ frontier is mostly free and near-free models — you can buy 89–94 Trust for cents. Public leaderboards rank capability and are blind to all of this.

Finding 04

We can fix it — the recalibration payoff

A per-model isotonic mapping (5-fold cross-validated) measurably cuts each model's ECE. Open any row to see its before→after. That mapping is the bias-corrector the AEQUARA SDK applies in production.

Honest limits · binding per ETHICS §3

What v1 is — and isn't.

  • A benchmark, not the last word. Current-flagship models, two substrates, a full multi-axis suite with CIs. Next: generational version-vs-version comparison, sycophancy & abstention axes, more sizes and live questions.
  • Temporal-leakage caveat. The forecasting events resolved before current cutoffs. Mitigated because the signal is in calibration, not accuracy; a clean test wants post-cutoff questions.
  • This scores AI models, not humans. It validates AEQUARA's trust-scoring on the population we've fully measured. The human cross-domain trust layer is a separate, running experiment.
  • Two arms run a different regime. Perplexity Sonar is web-grounded (it can retrieve), and Qwen3-Coder is a local code model on CPU — both flagged in the table; read them in that light, not head-to-head with the frontier. Selective-risk (AURC) is a confidence-threshold proxy, not true abstention.
Provenance · reproduce it yourself

Generated by ai-genome/trust_index.py over two committed elicitation files — calibration_pairs.csv (knowledge / MMLU, 5,724 pairs) and forecasting_pairs.csv (resolved events / Autocast, 7,447 pairs). Bootstrap CIs use a fixed seed (20260606, B=600). The machine-readable artifact behind this page is calibration-index.json; the full method, ranking taxonomy, cross-domain correlations, and human-crowd baseline are in the research write-up.

This leaderboard is built to be bit-for-bit reproducible. Major leaderboards cannot be recomputed from published source — Chatbot Arena (private vote logs), the HF Open LLM Leaderboard (closed eval queue), HELM (non-pinned harness). Ours can: same data + same seed → same SHA-256, exactly — verified end-to-end on our side, with the public repro kit (input CSVs + harness) staged for release. One command verifies it:

$ python verify_reproducibility.py
  [PASS] input integrity        calibration_pairs.csv
  [PASS] input integrity        forecasting_pairs.csv
  [PASS] output reproducibility trust_index.json
  VERDICT: [PASS] FULLY_REPRODUCED

pinned SHA-256 (v1.2, 2026-06-10) — calibration_pairs.csv 1704e7d7…1d95a3f · forecasting_pairs.csv 697e7f09…c518c03 · trust_index.json 63c1e842…991db39

Stay calibrated

Get each index update.

One email per release — new models, re-ranks, and the pinned checksums. No upsell, no list resale.

Need a formal calibration attestation for model risk? Request an attestation → · or see the full platform

This is the measured substrate behind the model panel you see on every verdict. Where the per-tool Track Record awaits the production verdict pipeline (and says so), the Model Calibration Index is real today — because measuring model calibration needs no user data and no external gate. When AEQUARA's confidence scores claim to be "calibrated, not theatrical," this is the proof.

← Back to Trust home