Skip to content

Calibration Status

All scoring parameters today carry calibrated: false. This page explains what that means, why it matters, and what the calibration roadmap looks like.


Current state

Every API response containing a substantive block carries a calibrated: false flag and a null calibration_method. This is not a disclaimer covering missing functionality. It is an accurate statement of the epistemic state of the parameters. The evidence weights, half-lives, and prior parameters are theory-informed defaults derived from the predictive-validity literature (Sackett 2022, Arthur 1998, Wang 2013). They are not yet fitted to Bukti's own outcome data.


What "calibrated" means in this system

A calibrated scoring system is one where the predicted probabilities match empirical frequencies. If the system predicts a substantive median of 0.70 for a capability claim, then in the real world, 70% of entities with that score should perform the capability competently when tested.

Calibration is measured using:

  • Reliability diagrams: predicted probability buckets on the x-axis, empirical success rate on the y-axis. A perfectly calibrated system produces points on the diagonal.
  • Brier score (Murphy 1973 decomposition): BS = reliability − resolution + uncertainty. The aim is to minimize the reliability component (systematic over/underprediction), not the uncertainty component (which reflects genuine difficulty of prediction).
  • Platt scaling or isotonic regression: recalibration techniques applied per evidence type, per cluster, when the reliability diagram reveals systematic bias.

Calibration roadmap

When the first pilot cohort has produced a meaningful set of (entity, capability, predicted, observed) pairs, the system will:

  1. Compute per-evidence-type, per-cluster Brier scores.
  2. Produce reliability diagrams for each predicted-probability decile.
  3. Fit recalibration corrections (Platt scaling first; isotonic regression if the reliability curve is non-monotonic).
  4. Publish corrected weight values, reliability diagrams, and Brier scores.

The target: reliability error within 0.05 per decile at sufficient sample size per decile.

A subsequent step adds hierarchical pooling across capability clusters and per-attestor reliability calibration. Threshold values for the tier matrix will be tuned to match labeled outcome rates.


Why publish uncalibrated values at all

The alternative — waiting for calibrated values before publishing any parameters — would mean operating an opaque system. Bukti's position, modeled on Twitter Community Notes and FICO's regulatory practice, is that transparent and honest about limitations is preferable to polished-but-opaque. The calibrated: false flag and this page are the disclosure mechanism.