# Model-Risk Calibration Attestation

> **NEXUS-BY-BRIER** · AEQUARA verified-judgment infrastructure
> Vendor-neutral · tamper-evident · independently re-verifiable

- **Subject of attestation:** ILLUSTRATIVE SAMPLE - synthetic data, no real model
- **Attestation ID:** `ATT-8556DEB4390E`
- **Generated (UTC):** 2026-06-09T21:08:58Z
- **Decisions attested:** 180 resolved (of 180 read; 0 unresolved, 0 invalid-probability excluded)
- **Models covered:** 3
- **Evidence ledger:** `out/sample.evidence.jsonl`
- **Merkle tip (commits to all rows):** `50a5a296c397eaf45e92de86a2da85800e7dba751a10667a37496f6a35e265f4`

## 1. Calibration summary

| Model | N | Base rate | Mean predicted | Brier ↓ | Brier skill ↑ | ECE ↓ | MCE ↓ | AUC ↑ | Verdict |
|---|--:|--:|--:|--:|--:|--:|--:|--:|---|
| model-aurora-v2 | 60 | 0.6667 | 0.7694 | 0.2089 | 0.0598 | 0.1279 | 0.3826 | 0.7087 | marginally calibrated |
| model-orion-lite | 60 | 0.7000 | 0.6000 | 0.2012 | 0.0417 | 0.1400 | 0.2023 | 0.7097 | marginally calibrated |
| model-vega-fast | 60 | 0.7000 | 0.8844 | 0.2391 | -0.1384 | 0.1844 | 0.2950 | 0.6019 | poorly calibrated |

*Lower Brier / ECE / MCE = better calibrated. Brier skill score > 0 means the model beats a constant base-rate forecaster. AUC = probability the model assigns a higher score to a true case than a false one (0.5 = no discrimination).*

## 2. Per-model reliability detail

### model-aurora-v2

- Calibration: **marginally calibrated**; over-confident (predicted probabilities exceed observed outcomes).

| Predicted-prob bin | N | Mean predicted | Observed rate | Gap |
|---|--:|--:|--:|--:|
| [0.50, 0.60] | 5 | 0.5826 | 0.2000 | 0.3826 |
| [0.60, 0.70] | 14 | 0.6512 | 0.5714 | 0.0798 |
| [0.70, 0.80] | 17 | 0.7526 | 0.6471 | 0.1055 |
| [0.80, 0.90] | 11 | 0.8404 | 0.9091 | 0.0687 |
| [0.90, 1.00] | 13 | 0.9303 | 0.7692 | 0.1611 |

### model-orion-lite

- Calibration: **marginally calibrated**; under-confident (predicted probabilities trail observed outcomes).

| Predicted-prob bin | N | Mean predicted | Observed rate | Gap |
|---|--:|--:|--:|--:|
| [0.40, 0.50] | 13 | 0.4771 | 0.3846 | 0.0925 |
| [0.50, 0.60] | 19 | 0.5346 | 0.7368 | 0.2023 |
| [0.60, 0.70] | 11 | 0.6504 | 0.7273 | 0.0769 |
| [0.70, 0.80] | 17 | 0.7347 | 0.8824 | 0.1476 |

### model-vega-fast

- Calibration: **poorly calibrated**; over-confident (predicted probabilities exceed observed outcomes).

| Predicted-prob bin | N | Mean predicted | Observed rate | Gap |
|---|--:|--:|--:|--:|
| [0.70, 0.80] | 6 | 0.7950 | 0.5000 | 0.2950 |
| [0.80, 0.90] | 28 | 0.8541 | 0.6429 | 0.2112 |
| [0.90, 1.00] | 26 | 0.9377 | 0.8077 | 0.1300 |

## 3. Regulatory cover memo

This attestation produces evidence that maps to the documentation and validation expectations of the following model-risk frameworks. Applicability is indicative; confirm with counsel (see disclaimer).

- **SR 26-2 — Revised Interagency Guidance on Model Risk Management (Fed/OCC/FDIC, eff. 2026-04-17; supersedes SR 11-7 & SR 21-8)** — SR 26-2 carries forward the SR 11-7 core principle that model validation includes 'outcomes analysis' and 'ongoing monitoring' — documented comparison of model output against realized outcomes — under a risk-based approach (most relevant to organizations with >$30B in assets). The per-model Brier score, reliability table, and observed-vs-predicted gap below are one rigorous form of that performance evidence, computed over the supplied decision log. NOTE: SR 26-2 is non-binding and places generative / agentic AI OUTSIDE its scope; this memo addresses quantitative / statistical models, not GenAI systems.
- **EU AI Act — Article 13 (Transparency & provision of information to deployers)** — For systems in scope as high-risk, Article 13 expects accompanying information enabling deployers to interpret output and use it appropriately, including known limitations and accuracy. The calibration-quality / confidence-direction verdict and the discrimination (AUC) figure document how literally the model's stated confidence can be trusted.
- **EU AI Act — Article 12 (Record-keeping) / SR 26-2 effective-challenge & audit trail** — Expects automatic, tamper-resistant logging of events over the system's lifecycle. The attached evidence ledger is SHA-256 Merkle-chained: any edit, reorder, or deletion of a logged decision breaks the chain and is detectable by an independent party (recipe below).

## 4. Tamper-evidence & independent re-verification

The evidence ledger is SHA-256 Merkle-chained in the NEXUS ledger convention: each row stores `prev_hash = SHA-256(previous_row_text + "\n")`, anchored by a GENESIS sentinel. Any mutation of any logged decision changes that row's text and breaks every subsequent hash. An independent party can confirm the chain with NEXUS's production verifier:

```
python scripts/verify-jsonl-merkle-chain.py out/sample.evidence.jsonl --verbose
```

…or re-run this tool's self-check (recomputes the chain AND every metric, and confirms they match what is recorded):

```
python generate_attestation.py --verify out/sample.attestation.json
```

Expected Merkle tip: `50a5a296c397eaf45e92de86a2da85800e7dba751a10667a37496f6a35e265f4`

## 5. Disclaimer (calibrated scope)

This is a vendor-neutral attestation of the calibration of the *logged decisions supplied*. It certifies that, on those rows, the model's stated probabilities matched observed outcomes to the degree reported, and that the underlying log is tamper-evident. It is NOT a guarantee of model quality on unseen inputs, NOT a certification of regulatory compliance, and NOT legal advice. SR 26-2 (which supersedes SR 11-7 and SR 21-8, eff. 2026-04-17) is non-binding, is most relevant to banking organizations with >$30B in assets, and places generative / agentic AI outside its scope. Regulatory applicability (SR 26-2, EU AI Act, and any successor guidance) must be confirmed by qualified counsel against the deployer's specific use case. Calibration measured on past logged outcomes does not transfer to a different population or task without revalidation.

## Appendix — definitions

- **Brier score** = mean( (predicted_probability − observed_outcome)² ). 0 = perfect, 0.25 = a coin-flip on a 50/50 base rate.
- **ECE** (Expected Calibration Error) = N-weighted mean gap between predicted probability and observed rate across 10 equal-width bins. **MCE** = the worst single-bin gap.
- **AUC** measures discrimination (ranking), which is independent of calibration: a model can rank well yet be mis-calibrated, which is precisely the risk this report surfaces.

*Generated by generate_attestation.py (calibration-attestation/v1). Methodology is open and deterministic — the same input reproduces this attestation byte-for-byte.*