Home › Trust › Track Record

Surface 2 · public scoreboard

⚠ Prototype — scores below are mock data. Q3 2026 is the first real publication, computed from production verdict logs. The structure of this page is the binding commitment per Constitution §9 — the numbers will be real.

When we're right, when we're wrong.

Per-tool Brier calibration scores updated quarterly. Per Constitution §9: if any tool's quarterly Brier degrades by more than 0.05, we pause it, publish the degradation here, notify every affected user, and refund the quarter.

Portfolio Brier (Q1 2026)

0.18 mock

target <0.25 · "well-calibrated"

Tools tracked

10 live

+ 47 in pipeline

Verdicts this quarter

— Q2 onward

prod pipeline coming online

Tools paused for drift

0 currently

auto-pause trigger: >0.05 drift

Per-tool calibration

Mock data · clearly labeled. Real Q2 numbers ship when verdict pipeline reaches production.

Tool	Brier (Q1)	Band	Trend vs Q4	Verdicts
Severance Analyzeremployment · for the laid-off worker	0.14	well-calibrated	↓ 0.02 (better)	~1,400 mock	Publishes Q3 2026
Contract Surgeoncontract review · for the receiving party	0.17	well-calibrated	± 0.00	~890 mock	Publishes Q3 2026
IRS Shieldtax positions · for the taxpayer	0.22	well-calibrated	↓ 0.04 (better)	~620 mock	Publishes Q3 2026
Medical Bill Defenderbilling analysis · for the patient	0.19	well-calibrated	↑ 0.03 (worse)	~780 mock	Publishes Q3 2026
Lease Analyzertenant rights · for the tenant	0.20	well-calibrated	↓ 0.01 (better)	~1,100 mock	Publishes Q3 2026
Divorce Decoderprocess navigation · for the asking spouse	0.28	directional	↓ 0.05 (better)	~320 mock	Publishes Q3 2026
Offer Letter Analyzercomp review · for the candidate	0.13	well-calibrated	± 0.00	~2,100 mock	Publishes Q3 2026
Insurance Claim Coachclaim filing · for the claimant	0.24	well-calibrated	↑ 0.02 (worse)	~540 mock	Publishes Q3 2026
Demand Letter Proletter drafting · for the sender	0.30	directional	↑ 0.06 (worse — auto-pause triggered)	~210 mock	Publishes Q3 2026
KAIROSlearning hubs · for the learner	0.11	well-calibrated	↓ 0.03 (better)	~3,400 mock	Publishes Q3 2026

How to read these scores

Brier 0.00–0.25 — well-calibrated. The predicted confidence matches the actual outcome rate. Use the verdict as your verdict (with override always available).

Brier 0.25–0.40 — directional. Treat as a tentative input, not a decision. The verdict will be labeled NEEDS HUMAN REVIEW.

Brier > 0.40 — miscalibrated. Tool is auto-paused. We do not ship verdicts users cannot trust. Retraining gates next deployment.

Known misses · Q1 2026 preview

Public-when-wrong is binding. Each sustained appeal publishes here (PII redacted unless user opts in).

Severance Analyzer2026-04-12verdict: SAFE TO SEND (0.84)actual: rejected by employer

Mock miss — counter-offer overestimated by 18%

Model panel converged on $48K severance as median; employer countered at $39K and held. Root cause: Levels.fyi cell had only 47 records for the user's role+region+tenure band (below 100-record threshold) but was still weighted at T3. Mitigation: tightened minimum-record threshold to 150 for low-volume bins; user refunded; flagged for adversarial-pass strengthening.

Medical Bill Defender2026-03-28verdict: NEEDS HUMAN REVIEW (0.61)actual: bill upheld

Mock miss — CPT code interpretation conflict

Model flagged CPT 99213 as upcoded from 99212; hospital provided documentation supporting the 99213 level. Root cause: substrate did not weight the hospital's documentation-pattern record (T2 regulatory source); panel under-routed to medical-coding specialist model. Mitigation: added documentation-pattern as required T2 input for E&M code disputes; user refunded.

Breakdowns by axis

By languagePublishes Q3 2026

By jurisdictionPublishes Q3 2026

By counterparty classPublishes Q3 2026

Demographic disparate-impact auditPublishes Q3 2026

Per-tool model-panel composition→

8-quarter Brier time-seriesPublishes Q3 2026

The Brier methodology behind these scores is documented at /trust-v2/methodology. The binding commitment to publish them is Constitution §9. The appeal-process when we're wrong is /trust-v2/appeal.

← Back to Trust home