Home  ›  Trust  ›  Track Record
Surface 2 · public scoreboard

When we're right, when we're wrong.

Per-tool Brier calibration scores updated quarterly. Per Constitution §9: if any tool's quarterly Brier degrades by more than 0.05, we pause it, publish the degradation here, notify every affected user, and refund the quarter.

Vibes-check substrate. The Brier scores below are mock illustrative data for prototype review. Q3 2026 will be the first real publication, computed from production verdict logs once the verdict pipeline is live. The structure of this page is the binding commitment per Constitution §9 — the numbers will be real.
Portfolio Brier (Q1 2026)
0.18 mock
target <0.25 · "well-calibrated"
Tools tracked
10 live
+ 47 in pipeline
Verdicts this quarter
Q2 onward
prod pipeline coming online
Tools paused for drift
0 currently
auto-pause trigger: >0.05 drift

Per-tool calibration

Mock data · clearly labeled. Real Q2 numbers ship when verdict pipeline reaches production.

Tool Brier (Q1) Band Trend vs Q4 Verdicts
Severance Analyzeremployment · for the laid-off worker 0.14 well-calibrated ↓ 0.02 (better) ~1,400 mock Publishes Q3 2026
Contract Surgeoncontract review · for the receiving party 0.17 well-calibrated ± 0.00 ~890 mock Publishes Q3 2026
IRS Shieldtax positions · for the taxpayer 0.22 well-calibrated ↓ 0.04 (better) ~620 mock Publishes Q3 2026
Medical Bill Defenderbilling analysis · for the patient 0.19 well-calibrated ↑ 0.03 (worse) ~780 mock Publishes Q3 2026
Lease Analyzertenant rights · for the tenant 0.20 well-calibrated ↓ 0.01 (better) ~1,100 mock Publishes Q3 2026
Divorce Decoderprocess navigation · for the asking spouse 0.28 directional ↓ 0.05 (better) ~320 mock Publishes Q3 2026
Offer Letter Analyzercomp review · for the candidate 0.13 well-calibrated ± 0.00 ~2,100 mock Publishes Q3 2026
Insurance Claim Coachclaim filing · for the claimant 0.24 well-calibrated ↑ 0.02 (worse) ~540 mock Publishes Q3 2026
Demand Letter Proletter drafting · for the sender 0.30 directional ↑ 0.06 (worse — auto-pause triggered) ~210 mock Publishes Q3 2026
KAIROSlearning hubs · for the learner 0.11 well-calibrated ↓ 0.03 (better) ~3,400 mock Publishes Q3 2026
How to read these scores
Brier 0.00–0.25 — well-calibrated. The predicted confidence matches the actual outcome rate. Use the verdict as your verdict (with override always available).
Brier 0.25–0.40 — directional. Treat as a tentative input, not a decision. The verdict will be labeled NEEDS HUMAN REVIEW.
Brier > 0.40 — miscalibrated. Tool is auto-paused. We do not ship verdicts users cannot trust. Retraining gates next deployment.

Known misses · Q1 2026 preview

Public-when-wrong is binding. Each sustained appeal publishes here (PII redacted unless user opts in).

Severance Analyzer2026-04-12verdict: SAFE TO SEND (0.84)actual: rejected by employer

Mock miss — counter-offer overestimated by 18%

Model panel converged on $48K severance as median; employer countered at $39K and held. Root cause: Levels.fyi cell had only 47 records for the user's role+region+tenure band (below 100-record threshold) but was still weighted at T3. Mitigation: tightened minimum-record threshold to 150 for low-volume bins; user refunded; flagged for adversarial-pass strengthening.

Medical Bill Defender2026-03-28verdict: NEEDS HUMAN REVIEW (0.61)actual: bill upheld

Mock miss — CPT code interpretation conflict

Model flagged CPT 99213 as upcoded from 99212; hospital provided documentation supporting the 99213 level. Root cause: substrate did not weight the hospital's documentation-pattern record (T2 regulatory source); panel under-routed to medical-coding specialist model. Mitigation: added documentation-pattern as required T2 input for E&M code disputes; user refunded.

Breakdowns by axis

By languagePublishes Q3 2026
By jurisdictionPublishes Q3 2026
By counterparty classPublishes Q3 2026
Demographic disparate-impact auditPublishes Q3 2026
Per-tool model-panel composition
8-quarter Brier time-seriesPublishes Q3 2026

The Brier methodology behind these scores is documented at /trust-v2/methodology. The binding commitment to publish them is Constitution §9. The appeal-process when we're wrong is /trust-v2/appeal.

← Back to Trust home