Which models we trust — and how much.
AEQUARA verdicts come from a panel of frontier models. Before we weight any model's vote, we measure what the public leaderboards ignore: calibration — whether a model's stated confidence matches how often it's actually right. This is the measured scoreboard behind the panel — a full calibration benchmark you can re-rank by any axis, not a single number.
trust_index.py over committed, auditable source data — reported with bootstrap 95% confidence intervals, not point estimates. This is the first AEQUARA Trust surface backed by genuine measurement, and it needed no proprietary or human-panel data. It delivers Constitution Commitment 01: calibrated, not theatrical.The calibration leaderboard
Ranked by Trust by default — or re-rank by any axis below (honest confidence, tail risk, discrimination, selective risk, cost…) and per individual domain. Click any row for its reliability diagrams and full metric suite.
Loading measured calibration data… |
Same family, version over version. The gold-bordered card is the most honest release (forecasting slope closest to 1.0). The answer isn’t a clean yes — calibration shifts across releases rather than climbing monotonically, often trading forecasting honesty for knowledge accuracy. Bootstrap CIs overlap, so treat modest gaps as ties; the directional trade-offs are real and a capability leaderboard can’t see them.
honest confidence (slope), tail risk, discrimination, selective risk or cost — the winner changes with the question you're asking.1.0 = perfectly calibrated direction; <1 = over-confident; >1 = under-confident. Most of the frontier sits below 1 on forecasts — but not all.Tail ≥90% asks "when it claims near-certainty, how often right?" Selective risk (AURC) asks "if it only answered its most-confident questions, how good?"What this establishes
Findings a buyer of trust-scoring can act on — each grounded in the measured suite above.
One ranking is never enough
The best-calibrated model on average (Trust) is not the one you'd trust at near-certainty, nor the best per dollar. Re-rank by honest confidence and the Trust #1 can fall toward the bottom — which is exactly why a single leaderboard number is a simplification.
Most of the frontier is over-confident — but a calibrated tier is emerging
Calibration slope sits below 1.0 for most models on forecasting, yet the current flagships are catching up — Sonnet 4.6 (1.09) and Gemini Pro (0.99) are essentially honest, while cheap models stay sharply over-confident (GPT-4o mini 0.42). Rule: discount stated confidence most on the cheap models.
Calibration is orthogonal to capability — and to cost
Accuracy barely separates the field; calibration spans a wide range, and AUROC tells yet another story. The Trust-vs-cost ★ frontier is mostly free and near-free models — you can buy 89–94 Trust for cents. Public leaderboards rank capability and are blind to all of this.
We can fix it — the recalibration payoff
A per-model isotonic mapping (5-fold cross-validated) measurably cuts each model's ECE. Open any row to see its before→after. That mapping is the bias-corrector the AEQUARA SDK applies in production.
What v1 is — and isn't.
- A benchmark, not the last word. Current-flagship models, two substrates, a full multi-axis suite with CIs. Next: generational version-vs-version comparison, sycophancy & abstention axes, more sizes and live questions.
- Temporal-leakage caveat. The forecasting events resolved before current cutoffs. Mitigated because the signal is in calibration, not accuracy; a clean test wants post-cutoff questions.
- This scores AI models, not humans. It validates AEQUARA's trust-scoring on the population we've fully measured. The human cross-domain trust layer is a separate, running experiment.
- Two arms run a different regime. Perplexity Sonar is web-grounded (it can retrieve), and Qwen3-Coder is a local code model on CPU — both flagged in the table; read them in that light, not head-to-head with the frontier. Selective-risk (AURC) is a confidence-threshold proxy, not true abstention.
Generated by ai-genome/trust_index.py over two committed elicitation files — calibration_pairs.csv (knowledge / MMLU, 5,724 pairs) and forecasting_pairs.csv (resolved events / Autocast, 7,447 pairs). Bootstrap CIs use a fixed seed (20260606, B=600). The machine-readable artifact behind this page is calibration-index.json; the full method, ranking taxonomy, cross-domain correlations, and human-crowd baseline are in the research write-up.
This leaderboard is built to be bit-for-bit reproducible. Major leaderboards cannot be recomputed from published source — Chatbot Arena (private vote logs), the HF Open LLM Leaderboard (closed eval queue), HELM (non-pinned harness). Ours can: same data + same seed → same SHA-256, exactly — verified end-to-end on our side, with the public repro kit (input CSVs + harness) staged for release. One command verifies it:
$ python verify_reproducibility.py [PASS] input integrity calibration_pairs.csv [PASS] input integrity forecasting_pairs.csv [PASS] output reproducibility trust_index.json VERDICT: [PASS] FULLY_REPRODUCED
pinned SHA-256 (v1.2, 2026-06-10) — calibration_pairs.csv 1704e7d7…1d95a3f · forecasting_pairs.csv 697e7f09…c518c03 · trust_index.json 63c1e842…991db39
Get each index update.
One email per release — new models, re-ranks, and the pinned checksums. No upsell, no list resale.
This is the measured substrate behind the model panel you see on every verdict. Where the per-tool Track Record awaits the production verdict pipeline (and says so), the Model Calibration Index is real today — because measuring model calibration needs no user data and no external gate. When AEQUARA's confidence scores claim to be "calibrated, not theatrical," this is the proof.