Trust Layer
New in v0.3.0, the Trust Layer is what turns Provenir from an orchestration framework into the trust layer for model post-training. It adds the primitives that raw-throughput engines (verl, TRL, Unsloth) leave out: RL observability, reward-hacking detection, contamination-safe evaluation, verifiable rewards, deterministic replay, and a signed Model Passport.
This guide covers:
- The
import provenirwrapper — drop the trust layer into any run - The RL Flight Recorder
- The reward-hacking detector
- Verifiable-reward environments (RLVR)
- The contamination firewall + canary vaults
- The Model Passport (sign, verify, risk flags)
All examples are copy-paste accurate against the v0.3.0 API.
1. The import provenir wrapper
The fastest way to adopt the trust layer is the 3-line substrate. Wrap any
existing training loop — verl, TRL, Unsloth, or your own — in a
provenir.track(...) context. Provenir attaches provenance, an RL Flight
Recorder, a reward-hacking report, and (optionally) a signed Model Passport
without you changing your trainer.
import provenir
with provenir.track("my-run", dataset=train_ds) as run:
for step, metrics in enumerate(training_loop()):
run.log_step(metrics) # feeds the flight recorder
run.record_eval("mmlu", score=0.71)
# Everything the trust layer produced is available on `run` after exit:
manifest = run.manifest # content-addressed RunManifest
flight_recorder = run.flight_recorder # FlightRecorder with per-step anomalies
hacking_report = run.hacking_report # HackingReport over buffered trajectories
lineage = run.lineage # LineageGraph (dataset -> run -> eval)
bom = run.bom # ModelBOM (bill of materials)
passport = run.passport # signed ModelPassport, or None if unsigned
anomalies = run.anomalies # list[Anomaly] the flight recorder caught
To also emit a signed Model Passport, pass a signing key:
with provenir.track(
"my-run",
dataset=train_ds,
base_model="Qwen2.5-7B",
sign_passport=True,
signing_key=b"team-signing-key",
output_dir="artifacts/my-run",
) as run:
for step, metrics in enumerate(training_loop()):
run.log_step(metrics)
run.log_trajectory({"prediction": last_response, "reward": reward})
run.record_eval("gsm8k", score=0.71)
assert run.passport is not None
assert run.passport.verify(b"team-signing-key")
metrics can be any dict of training/RL signals — recognised keys are step,
kl, entropy, reward_mean, reward_std, response_length_mean,
advantage_std, grad_norm, learning_rate (unknown keys are ignored). On
exit, every artifact is written under output_dir
(manifests/<run_id>.json, lineage.json, flight_recorder.json,
hacking_report.json, bom.json, and passport.md / passport.json when
signed).
2. The RL Flight Recorder
The Flight Recorder (provenir.observability) is a black box for RL runs.
It watches every step and flags the pathologies that quietly wreck RL training
long before the final reward curve looks wrong:
kl_blowup/kl_collapseentropy_collapselength_explosionadvantage_collapse(GRPO group degeneracy)reward_std_collapsereward_spikegrad_explosion
Logging step metrics and reading anomalies
from provenir.observability import FlightRecorder, RLStepMetrics
recorder = FlightRecorder()
for step, batch in enumerate(rl_loop()):
anomalies = recorder.log_step(RLStepMetrics(
step=step,
kl=batch.kl,
entropy=batch.entropy,
reward_mean=batch.reward_mean,
reward_std=batch.reward_std,
advantage_std=batch.advantage_std,
response_length_mean=batch.mean_response_length,
grad_norm=batch.grad_norm,
))
for a in anomalies: # anomalies detected on THIS step
print(f"[step {a.step}] {a.severity} {a.kind}: {a.detail}")
# `anomalies` is also a property holding everything caught so far:
for a in recorder.anomalies:
print(a.kind, a.detail)
print(recorder.health_report()) # HEALTHY / DEGRADED / CRITICAL summary
summary = recorder.summary() # counts by kind/severity + final metrics
log_step accepts an RLStepMetrics; the import provenir wrapper also lets
you pass a plain dict. Because the recorder only needs per-step scalars, it
works whether the gradient step runs on verl, TRL, Unsloth, or a custom loop.
3. The reward-hacking detector
Reward hacking is the #1 RL bottleneck: models learn to maximise the reward
signal without doing the task — inflating response length, exploiting output
format, or literally editing the tests to make them pass. The detector
(provenir.observability) looks for length_inflation, format_exploit,
test_tampering (unittest.skip, sys.exit(0), monkeypatching),
verifier_gaming, proxy_divergence (reward up, real eval flat/down),
degenerate_repetition, and advantage_collapse.
from provenir.observability import RewardHackingDetector
detector = RewardHackingDetector()
# One trajectory:
signals = detector.detect({
"prediction": "import sys; sys.exit(0) # skip failing tests",
"proxy_reward": 1.0,
"true_reward": 0.0,
})
print([s.kind for s in signals]) # ['test_tampering', 'proxy_divergence']
# A batch of rollouts -> a HackingReport:
report = detector.detect_batch(rollouts) # rollouts: list[dict] with prediction/reward
if not report.is_clean:
for signal in report.signals:
print(f" - {signal.kind}: {signal.detail}")
print(f"hacking rate: {report.hacking_rate:.1%}")
print(report.by_kind()) # {'test_tampering': 3, 'proxy_divergence': 1}
# Group degeneracy (all rewards equal -> zero advantage):
group_signals = detector.detect_group([0.5, 0.5, 0.5, 0.5]) # advantage_collapse
Pair this with the Model Passport so a run that shows
reward hacking is surfaced in its contaminated_eval / risk-flag story.
4. Verifiable-reward environments (RLVR)
For RL with verifiable rewards, Provenir ships a library of sandboxed,
hack-resistant reward functions (provenir.environments) behind an
OpenEnv-compatible Environment protocol. Every verifier exposes
verify(response, reference) -> VerificationResult (with .passed, .reward,
.detail, .metadata):
| Verifier | Checks |
|---|---|
ExactAnswerVerifier |
Exact match against a gold answer (\boxed{}, ####, or raw) |
MathVerifier |
Numeric equivalence with tolerance |
RegexFormatVerifier |
Output matches a required format |
JSONSchemaVerifier |
Output parses as JSON with required keys/types |
ToolCallVerifier |
Valid tool call with an allowed tool + dict args |
ContainsVerifier |
Required substrings present, forbidden absent |
CompositeVerifier |
Weighted combination of verifiers |
CodeVerifier |
Runs code in a PythonSandbox (subprocess isolation + hacking detection) |
from provenir.environments import (
MathVerifier,
RegexFormatVerifier,
CompositeVerifier,
)
verifier = CompositeVerifier([
(MathVerifier(), 0.7),
(RegexFormatVerifier(pattern=r"\\boxed\{.*\}", full_match=False), 0.3),
])
result = verifier.verify(model_output, gold_answer) # (response, reference)
print(result.passed, result.reward, result.detail)
The CodeVerifier's PythonSandbox runs candidate solutions in an isolated
subprocess and inspects the generated code, so a model that tries to
unittest.skip or sys.exit(0) its way to a passing test is flagged
(metadata["suspected_hacking"]) rather than rewarded:
from provenir.environments import CodeVerifier
code_verifier = CodeVerifier()
result = code_verifier.verify(candidate_solution, {"test_code": unit_tests})
Orchestrating GRPO / DAPO / GSPO
The RLOrchestrator (provenir.train.rl) fuses everything into one loop —
rollout -> verify -> reward -> flight recorder -> hacking detector -> eval
gate — and delegates the gradient step to a backend adapter. Pass a GRPOConfig,
DAPOConfig (decoupled clip + dynamic sampling), or GSPOConfig (sequence-level,
stabilises MoE):
from provenir.train.rl import RLOrchestrator
from provenir.train.algorithms import GRPOConfig
from provenir.environments import MathVerifier
from provenir.observability import FlightRecorder, RewardHackingDetector
orchestrator = RLOrchestrator(
algorithm=GRPOConfig(group_size=8, max_steps=100), # or DAPOConfig / GSPOConfig
verifier=MathVerifier(),
flight_recorder=FlightRecorder(),
hacking_detector=RewardHackingDetector(),
)
result = orchestrator.run(train_ds, eval_ds)
print(result.steps_completed, result.mean_reward)
print(result.anomaly_count, result.hacking_rate, result.halted)
print(result.flight_summary) # serialised flight-recorder state
Backend selection is automatic: provenir.train.backends.adapters wraps
verl / TRL / Unsloth with capability detection and a BackendSelector that
routes by scale tier. The provenir.train.rl_eval_gate.RLEvalGate fuses
contamination-safety + regression + reward-hacking into one loop guard that
halts a run before it wastes GPU budget.
On the CLI:
5. The contamination firewall + canary vaults
Train/eval contamination is the #1 eval-reliability pain: if eval examples leak
into training, your benchmark numbers are inflated and untrustworthy. The
contamination firewall (provenir.eval.contamination) detects overlap via
13-gram, embedding, or exact matching, with MinHash for scale.
from provenir.eval.contamination import ContaminationChecker, ContaminationConfig
checker = ContaminationChecker(ContaminationConfig(method="ngram"))
report = checker.check_datasets(train_dataset, eval_dataset)
print(f"{report.contamination_rate:.1%} contaminated across {len(report.hits)} hits")
for hit in report.hits:
print(f" train#{hit.train_index} <-> eval#{hit.eval_index} ({hit.method})")
clean_train = checker.filter_contaminated(train_dataset, report)
On the CLI:
Canary vaults
A canary (provenir.eval.canary) is a unique token embedded into a private
eval set. If that token later shows up in training data, you know the held-out
set has leaked — the strongest signal that a benchmark result cannot be trusted.
from provenir.eval.canary import CanaryGuard
guard = CanaryGuard()
canary = guard.mint("mmlu-private") # deterministic, reproducible token
# Embed the canary into each private-eval record before you distribute it:
tagged_eval = guard.tag(eval_dataset, canary)
# Later, scan any training corpus for the leaked canary:
leaked_rows = guard.scan(train_dataset, canary)
if leaked_rows:
raise RuntimeError(f"Private eval leaked into training at rows {leaked_rows}")
Judge calibration
When you evaluate with an LLM judge, provenir.eval.judge_calibration measures
position bias, self-consistency, and flip-rate, and gives you two debiased
wrappers:
from provenir.eval.judge_calibration import (
JudgeCalibrator,
DebiasedJudge,
EnsembleJudge,
)
from provenir.eval.judge import AnthropicJudge, OpenAIJudge, StubJudge
# Measure how reliable a judge is:
report = JudgeCalibrator(AnthropicJudge()).measure_position_bias(cases)
print(report.position_bias, report.flip_rate, report.is_reliable)
# Evaluate both orderings to remove position bias:
debiased = DebiasedJudge(AnthropicJudge())
# Majority vote across multiple judges:
ensemble = EnsembleJudge([AnthropicJudge(), OpenAIJudge(), StubJudge()])
6. The Model Passport
The Model Passport (provenir.governance.passport, built on
provenir.governance.bom) is a signed, portable Bill-of-Materials of
exactly what data, code, evals, and config produced a model. It is signed with
HMAC-SHA256 so it is tamper-evident, and it carries compliance risk flags. This
is the enterprise/regulatory acquisition wedge and maps directly to EU AI Act
Article 12 (tamper-proof audit trails + model lineage, enforced Aug 2, 2026).
The easy way — via the wrapper
If you used provenir.track(..., sign_passport=True, signing_key=...), the
signed passport is already on run.passport and written to disk. Otherwise,
build one explicitly.
Build, sign, verify
from provenir.governance.bom import (
ModelBOM,
DataComponent,
CodeComponent,
EvalComponent,
)
from provenir.governance.passport import PassportSigner, PassportStore
bom = ModelBOM(
model_id="my-model",
base_model="Qwen2.5-7B",
run_id=run.manifest.run_id,
data=[DataComponent(
name="train",
content_hash=run.manifest.dataset_hash,
num_records=1200,
license="apache-2.0",
pii_scanned=True,
contamination_checked=True,
)],
code=CodeComponent(
git_sha=run.manifest.git_sha,
dependencies_hash=run.manifest.dependencies_lockfile,
framework="trl",
),
evals=[EvalComponent(benchmark="gsm8k", score=0.71)],
hyperparameters={"lr": 1e-5},
)
signer = PassportSigner(key=b"team-signing-key", key_id="team")
passport = signer.sign(bom)
# Verify the HMAC-SHA256 signature — fails if the BOM was tampered with:
assert passport.verify(b"team-signing-key")
# Persist (also appends an immutable audit-log line):
PassportStore("artifacts/passports").save(passport)
On the CLI:
provenir passport show passport.json # print the passport markdown
provenir passport verify passport.json --key team-signing-key
Risk flags
The BOM derives compliance risk flags from the trust layer, so a downstream consumer can gate on them:
for flag in passport.bom.risk_flags():
print(flag)
# unscanned_pii -> a data component was not PII-scanned
# unchecked_contamination -> a data component was not contamination-checked
# contaminated_eval -> an eval was flagged as contaminated
# unknown_license -> a data component has no known license
A clean passport (no risk flags, valid signature) is a portable proof that a model was produced by a trustworthy, reproducible process.
Deterministic replay + lineage DAG
Underpinning the passport is the deterministic replay subsystem
(provenir.provenance): a content-addressed environment fingerprint,
kernel-determinism flags, a lineage DAG (dataset -> run -> adapter -> eval ->
merge), and a ReplayEngine that verifies whether a run can be reproduced.
from provenir.provenance import (
ReplayEngine,
capture_fingerprint,
kernel_determinism_flags,
)
from provenir.core.manifest import RunManifestStore
engine = ReplayEngine(RunManifestStore("artifacts/manifests"))
verification = engine.verify(
run.manifest.run_id,
current_config_hash=run.manifest.config_hash,
current_dataset_hash=run.manifest.dataset_hash,
current_fingerprint=capture_fingerprint(),
)
print(verification.reproducible, verification.differences)
# A full reproducibility recipe (seed, hashes, env flags, git SHA):
recipe = engine.replay_command(run.manifest.run_id)
print(kernel_determinism_flags()) # env vars for bitwise reproducibility
The lineage DAG lets you trace any artifact back to the exact dataset, code, and config that produced it — the model-lineage requirement behind modern AI governance regimes.
Putting it together
The trust layer is designed to compose: use verifiable-reward environments to
get a reward you cannot game, run the RLOrchestrator so the flight recorder,
hacking detector, and eval gate watch the whole loop, decontaminate against
your eval sets and canary-tag the private ones, then emit a signed Model
Passport as the portable, tamper-evident proof of what produced the model. Or,
for an existing loop, get all of it in three lines with
import provenir.