Skip to content

Evidence Type Weights

This page documents the categories of evidence and the direction of their weights in Bukti's substantive scoring. Specific numeric weights are tunable parameters held in private config until calibration data from the first pilot cohort moves them to public disclosure.


What these weights represent

In the Beta-Binomial formula, w_{t,c} is the operational validity of evidence type t in capability cluster c. It answers: "how many effective task-observations does one VOI of this type substitute for?" Higher weight means the evidence is a stronger signal of genuine capability. These weights are grounded in the predictive-validity literature for employment outcomes, primarily Sackett, Zhang, Berry & Lievens (2022) in Personnel Psychology (corrected operational validity estimates) and the 2024 follow-up by Lievens et al.

The weights are theory-informed defaults today, not calibrated to Bukti's specific outcome data. When pilot data yields (predicted, observed) outcome pairs, the weights will be recalibrated and disclosed.


Evidence type categories — relative ranking

Categories are listed strongest-first. Inside each category, weight scales with provenance strength (signed vs. unsigned, issuer rigor, attestor identity grade).

  1. task_outcome — Verified, measurable outcomes from real deployments. Strongest single category. Maps to the work-sample literature (Sackett 2022 Table 1).
  2. behavioral_artifact — Direct demonstration of capability through code, prototypes, or other behavioral output. Strength scales with cryptographic signing (Sigstore / GPG-signed > unsigned).
  3. credential_badge — Issuer-attested credentials. Weight contingent on issuer rigor and Open Badges 3.0 signature verification.
  4. publication_artifact — Peer-reviewed publication. Domain-dependent; particularly strong for research and pedagogy clusters.
  5. peer_attestation — Structured 360-feedback peer ratings outweigh unstructured peer endorsements (Sackett 2022 / Lievens 2024). Weight scales with attestor identity grade.
  6. contribution_artifact — Behavioral artifact with weaker provenance than signed commits.
  7. indirect_attestation — Mentions of an entity by third parties. Deliberately low weight: mentions are weak signal and easily fabricated.
  8. self_reported / self_authored — Capped low so that no volume of self-reports can reach the Attested tier alone.

Cluster-specific overrides

Some clusters carry higher default weights for specific evidence types. The literature finds that validity coefficients have high standard deviations across contexts (Sackett 2022 SD column): for software engineering, verified task outcomes (deployed, measurable systems) are the strongest possible signal, justifying a higher ceiling. For educational research, peer-reviewed publication is a stronger-than-default signal.

The category structure is public; the specific cluster-level numeric overrides are tunable.


What "calibrated" means here

The system reports calibrated: false until weights are validated. Calibration requires:

  1. A set of (entity, capability, predicted, observed) tuples from a pilot cohort.
  2. Brier-score computation per evidence type, per cluster.
  3. Isotonic regression or Platt scaling to correct systematic miscalibration.
  4. A reliability diagram showing predicted probability vs. empirical frequency in each decile.

See calibration-status.md.