Soothsayer — calibration receipts dashboard

1 · the calibration claim

What the consumer asks for is what they empirically get.

For every consumer target τ from 0.50 to 0.99 in 0.01 increments, we serve the deployed v2 / M5 Oracle on the OOS panel and compute realised coverage. A perfectly calibrated band lies on the diagonal. The four anchor points (0.68, 0.85, 0.95, 0.99) are where the deployed schedules C_BUMP_SCHEDULE and DELTA_SHIFT_SCHEDULE are tuned; everything else is linear-interpolation off-grid.

D5 — Inter-anchor τ validation · 50 levels × 1,720 obs = 86,000 served bands · Kupiec passes at 47/50

2 · vs the incumbents

Pyth's CI isn't a probability statement.

Pyth's published conf is documented as publisher dispersion, not probability of coverage. Read at face value (k = 1.96 → "95% Gaussian wrap"), Pyth covers ~10% of weekend Monday opens. To match Soothsayer's 95% claim, a consumer needs to scale Pyth's conf by ≈50× — a calibration the consumer must construct themselves. Chainlink during marketStatus = 5 publishes a band of zero width.

Pyth + naive ±k·conf — realised coverage as k grows · 265 OOS observations 2024+

Soothsayer at τ = 0.95

95.0%

354 bps half-width · Kupiec p=0.956 · n=1,730

Pyth + naive ±1.96·conf

10.2%

11 bps half-width · the published "claim" mis-calibrates 9× · n=265

Chainlink stale-hold

no band

100% of weekend obs have bid≈0, ask=0 · §1.1 thesis confirmed empirically

3 · per-symbol receipts

Every ticker, every weekend, on the record.

Per-symbol calibration evidence on the full panel (5,986 weekends, 12 years). The leave-one-out validation (D7) refits the calibration surface on the other 9 symbols and serves the held-out one through the pooled-fallback path — the production code path for tickers with sparse history. The mechanism transfers.

F1_emp_regime per-symbol · OOS τ = 0.95 (deployed) · sortable by any column

4 · robustness

The methodology is not load-bearing on hyperparameters.

Window size, train-test split timing, and held-out symbols all produce coverage within ±3pp of the deployed value. Window=156 is the only choice that simultaneously passes Kupiec at α=0.05 on all three targets — empirically defensible, not arbitrary.

Window sensitivity

…

Walk-forward stability (τ=0.95)

…

Cross-asset transfer (LOO)

…

D8 — Window-size sensitivity · realised coverage vs window · diagonal = target

5 · the receipts

Audit trail, methodology, and full reproducibility.

Every claim on this dashboard re-derives from the public repository. The methodology evolution log records every decision, hypothesis, and rejected alternative. The paper drafts under reports/paper1_coverage_inversion/ contain the formal version of these claims, currently being prepared for arXiv (q-fin.RM) and ACM AFT.

Methodology log →

history

Append-only record of every methodological decision, tested hypothesis, and rejected alternative.

Paper 1 drafts →

arXiv

Coverage inversion as an oracle primitive — §1, §2, §3, §6, §9, references. In flight.

Source repository →

github

Reproduce: scripts/run_calibration.py + 12 yrs of public data.