Calibration Status¶

All scoring parameters today carry calibrated: false. This page explains what that means, why it matters, and what the calibration roadmap looks like.

Current state¶

Every API response containing a substantive block carries a calibrated: false flag and a null calibration_method. This is not a disclaimer covering missing functionality. It is an accurate statement of the epistemic state of the parameters. The evidence weights, half-lives, and prior parameters are theory-informed defaults derived from the predictive-validity literature (Sackett 2022, Arthur 1998, Wang 2013). They are not yet fitted to Bukti's own outcome data.

What "calibrated" means in this system¶

A calibrated scoring system is one where the predicted probabilities match empirical frequencies. If the system predicts a substantive median of 0.70 for a capability claim, then in the real world, 70% of entities with that score should perform the capability competently when tested.

Calibration is measured using:

Reliability diagrams: predicted probability buckets on the x-axis, empirical success rate on the y-axis. A perfectly calibrated system produces points on the diagonal.
Brier score (Murphy 1973 decomposition): BS = reliability − resolution + uncertainty. The aim is to minimize the reliability component (systematic over/underprediction), not the uncertainty component (which reflects genuine difficulty of prediction).
Platt scaling or isotonic regression: recalibration techniques applied per evidence type, per cluster, when the reliability diagram reveals systematic bias.

Calibration roadmap¶

When the first pilot cohort has produced a meaningful set of (entity, capability, predicted, observed) pairs, the system will:

Compute per-evidence-type, per-cluster Brier scores.
Produce reliability diagrams for each predicted-probability decile.
Fit recalibration corrections (Platt scaling first; isotonic regression if the reliability curve is non-monotonic).
Publish corrected weight values, reliability diagrams, and Brier scores.

The target: reliability error within 0.05 per decile at sufficient sample size per decile.

A subsequent step adds hierarchical pooling across capability clusters and per-attestor reliability calibration. Threshold values for the tier matrix will be tuned to match labeled outcome rates.

Why publish uncalibrated values at all¶

The alternative — waiting for calibrated values before publishing any parameters — would mean operating an opaque system. Bukti's position, modeled on Twitter Community Notes and FICO's regulatory practice, is that transparent and honest about limitations is preferable to polished-but-opaque. The calibrated: false flag and this page are the disclosure mechanism.

evidence-weights.md — weight categories
decay-and-half-lives.md — decay model
limitations.md — known limitations

Calibration Status¶

Current state¶

What "calibrated" means in this system¶

Calibration roadmap¶

Why publish uncalibrated values at all¶

Related pages¶