Calibration receipts · OOS 2023+ · Last refit 2026-04-25

Realised coverage, at every τ a consumer can ask for.

The empirical evidence behind the Oracle's calibration claim, computed on a held-out 2023+ slice (1,720 weekend × symbol observations, 172 weekends, 10 tickers). All numbers re-derivable from the public repo via scripts/run_calibration.py; methodology evolution at reports/methodology_history.md.

1 · the calibration claim

What the consumer asks for is what they empirically get.

For every consumer target τ from 0.50 to 0.99 in 0.01 increments, we serve the deployed v2 / M5 Oracle on the OOS panel and compute realised coverage. A perfectly calibrated band lies on the diagonal. The four anchor points (0.68, 0.85, 0.95, 0.99) are where the deployed schedules C_BUMP_SCHEDULE and DELTA_SHIFT_SCHEDULE are tuned; everything else is linear-interpolation off-grid.

D5 — Inter-anchor τ validation · 50 levels × 1,720 obs = 86,000 served bands · Kupiec passes at 47/50
2 · vs the incumbents

Pyth's CI isn't a probability statement.

Pyth's published conf is documented as publisher dispersion, not probability of coverage. Read at face value (k = 1.96 → "95% Gaussian wrap"), Pyth covers ~10% of weekend Monday opens. To match Soothsayer's 95% claim, a consumer needs to scale Pyth's conf by ≈50× — a calibration the consumer must construct themselves. Chainlink during marketStatus = 5 publishes a band of zero width.

Pyth + naive ±k·conf — realised coverage as k grows · 265 OOS observations 2024+
Soothsayer at τ = 0.95
95.0%
354 bps half-width · Kupiec p=0.956 · n=1,730
Pyth + naive ±1.96·conf
10.2%
11 bps half-width · the published "claim" mis-calibrates 9× · n=265
Chainlink stale-hold
no band
100% of weekend obs have bid≈0, ask=0 · §1.1 thesis confirmed empirically
3 · per-symbol receipts

Every ticker, every weekend, on the record.

Per-symbol calibration evidence on the full panel (5,986 weekends, 12 years). The leave-one-out validation (D7) refits the calibration surface on the other 9 symbols and serves the held-out one through the pooled-fallback path — the production code path for tickers with sparse history. The mechanism transfers.

F1_emp_regime per-symbol · OOS τ = 0.95 (deployed) · sortable by any column
4 · robustness

The methodology is not load-bearing on hyperparameters.

Window size, train-test split timing, and held-out symbols all produce coverage within ±3pp of the deployed value. Window=156 is the only choice that simultaneously passes Kupiec at α=0.05 on all three targets — empirically defensible, not arbitrary.

Window sensitivity
Walk-forward stability (τ=0.95)
Cross-asset transfer (LOO)
D8 — Window-size sensitivity · realised coverage vs window · diagonal = target
5 · the receipts

Audit trail, methodology, and full reproducibility.

Every claim on this dashboard re-derives from the public repository. The methodology evolution log records every decision, hypothesis, and rejected alternative. The paper drafts under reports/paper1_coverage_inversion/ contain the formal version of these claims, currently being prepared for arXiv (q-fin.RM) and ACM AFT.