Core Concepts
This page explains the key abstractions in Provenir and how they fit together.
RunManifest — Reproducibility by Default
Every training run in Provenir produces a RunManifest: a content-addressed,
tamper-evident record that captures:
| Field | What it captures |
|---|---|
run_id |
UUID for this specific run |
config_hash |
SHA-256 of the serialised RunConfig |
dataset_hash |
SHA-256 of the training dataset |
git_sha |
Current HEAD commit (if in a git repo) |
seed |
Random seed used |
timestamp |
ISO-8601 start time |
provenance |
Chain of parent run IDs (for RLAIF iterations) |
Any run can be reproduced exactly:
Provenir will verify the config hash and dataset hash before starting and raise an error if either has changed.
The manifest is stored as a JSON file and surfaced via the REST API at
GET /manifests/{run_id}.
RunConfig — Unified Training Configuration
RunConfig is a Pydantic model that covers every aspect of a training job:
name: my-run
backend: trl # trl | stub
seed: 42
model_name_or_path: meta-llama/Llama-3.2-1B
max_steps: 1000
batch_size: 8
peft:
rank: 16
alpha: 32
target_modules: ["q_proj", "v_proj"]
load_in_4bit: false # QLoRA
distributed:
strategy: fsdp # fsdp | deepspeed | ddp | none
num_gpus: 8
observability_backend: wandb
Load from YAML:
Or construct in Python:
from provenir.core.config import RunConfig, PEFTConfig
config = RunConfig(
name="my-run",
model_name_or_path="meta-llama/Llama-3.2-1B",
peft=PEFTConfig(rank=16, alpha=32),
)
Training Backends
Backends implement the TrainingBackend protocol and are selected by the
backend field in RunConfig.
| Backend | When to use |
|---|---|
trl |
Production training — SFT, DPO, LoRA, QLoRA via HuggingFace TRL |
stub |
Testing and CI — returns immediately with a fake manifest |
The TRL backend supports:
- SFT (Supervised Fine-Tuning) — default algorithm
- DPO (Direct Preference Optimization) — for preference datasets
- LoRA — parameter-efficient fine-tuning via
peftconfig - QLoRA — 4-bit or 8-bit quantized LoRA
- rsLoRA — rank-stabilised LoRA scaling (
use_rslora: true)
Prompt Templates
Six built-in formats, switchable at runtime:
| Template | Chat format | Use case |
|---|---|---|
alpaca |
### Instruction: … ### Response: |
Classic instruction tuning |
chatml |
<|im_start|>user … <|im_end|> |
Multi-turn, OpenAI-compatible |
llama3 |
<|begin_of_text|> … <|eot_id|> |
Llama 3 official format |
mistral |
[INST] … [/INST] |
Mistral instruction format |
phi3 |
<|user|> … <|assistant|> |
Microsoft Phi-3 |
raw_completion |
{prompt}{response} |
Plain completion, no special tokens |
from provenir.data.templates import TEMPLATE_REGISTRY
# List available templates
names = TEMPLATE_REGISTRY.list_names()
# Format a record
text = TEMPLATE_REGISTRY.format("llama3", {
"system": "You are a helpful assistant.",
"prompt": "What is LoRA?",
"response": "LoRA is a parameter-efficient fine-tuning method...",
})
Evaluation Layer
Metrics
All metrics implement the MetricFn protocol:
| Metric | Description |
|---|---|
ExactMatch |
Binary string equality |
TokenF1 |
Token-level F1 score |
BLEU4 |
BLEU-4 (4-gram overlap) |
ROUGE_L |
Longest common subsequence recall |
All results include Wilson 95% confidence intervals — not just the point estimate. This catches "improvements" that are just noise.
from provenir.eval.harness import MultiMetricEvaluator
result = MultiMetricEvaluator().evaluate(dataset, predictions)
em = result.metrics["exact_match"]
print(f"{em.mean:.3f} [{em.ci_lower:.3f}, {em.ci_upper:.3f}]")
Regression Gate
The RegressionGate blocks promotion if eval scores drop below the baseline
by more than a configured tolerance:
from provenir.eval.regression import RegressionGate
gate = RegressionGate(tolerance=0.02) # allow up to 2% regression
gate.check(new_result, baseline_result) # raises RegressionError if gate trips
EvalCallback
EvalCallback hooks into training to run evaluation at configurable intervals
and optionally stop training early if quality stops improving:
from provenir.train.eval_callback import EvalCallback, EvalCallbackConfig
callback = EvalCallback(
config=EvalCallbackConfig(
eval_every_n_steps=100,
early_stopping_patience=3,
regression_tolerance=0.02,
),
evaluator=MultiMetricEvaluator(),
eval_dataset=eval_ds,
)
RAG Metrics
For retrieval-augmented generation evaluation:
| Metric | What it measures |
|---|---|
faithfulness |
Fraction of answer tokens grounded in context |
context_precision |
Overlap between context and question |
answer_relevance |
Fraction of question tokens covered by answer |
from provenir.eval.rag_metrics import RAGEvaluator
results = RAGEvaluator().evaluate([{
"question": "What is the capital of France?",
"answer": "Paris is the capital of France.",
"context": "France is a country in Western Europe. Its capital is Paris.",
}])
LLM-as-Judge
Provenir ships four judge implementations:
| Judge | Use case |
|---|---|
StubJudge |
Deterministic, no API calls — for CI and offline testing |
CachedJudge |
SHA-256 disk cache wrapping any other judge |
AnthropicJudge |
Pairwise and rubric scoring via Claude |
OpenAIJudge |
Pairwise and rubric scoring via GPT-4o |
All judges implement the LLMJudge protocol:
from provenir.eval.judge import AnthropicJudge, CachedJudge
judge = CachedJudge(AnthropicJudge(model="claude-haiku-4-5-20251001"))
# Pairwise: which response is better?
pref = judge.score_pairwise("What is LoRA?", response_a, response_b)
# → Preference(preferred='a', confidence=0.91, rationale='Response A is more precise...')
# Rubric: score against specific criteria
scores = judge.score_rubric("What is LoRA?", response, criteria=[
"factual accuracy", "clarity", "conciseness"
])
RLAIF Pipeline
Provenir's unique differentiator: an end-to-end loop that generates preference data, trains via DPO, evaluates, and iterates — without any human labelling.
dataset
│
▼
generate N response variants per prompt
│
▼
LLM judge: pairwise ranking of variants
│
▼
build (chosen, rejected) preference pairs
│
▼
DPO training on preference pairs
│
▼
automatic evaluation against held-out set
│
▼
regression gate (stop if quality regresses)
│
▼
next iteration (up to n_iterations)
See the RLAIF guide for full configuration and usage.
Data Flywheel
The DataFlywheel automatically mines prediction failures, generates augmented
variants via the judge, filters by quality, and augments the training set:
See the Data Flywheel guide.
Governance
Audit Log
Every significant event in Provenir is logged to an append-only JSONL file:
from provenir.governance.audit import AuditLogger
audit = AuditLogger("artifacts/audit")
audit.log("training_complete", actor="ci-bot", run_id=manifest.run_id, steps=500)
The audit log is exposed via GET /audit in the REST API and provenir audit
in the CLI.
PII & Secret Scanning
from provenir.governance.pii import PIIScanner, PIIMasker
from provenir.governance.scanners import SecretScanner
# Scan before training
pii_report = PIIScanner().scan(dataset)
secret_report = SecretScanner().scan(dataset)
if pii_report.has_pii:
dataset = PIIMasker().mask(dataset)
Model Cards
from provenir.governance.model_card import ModelCardGenerator
card = ModelCardGenerator().generate(manifest, eval_result)
card.save("MODEL_CARD.md")
Or via CLI:
Plugin Architecture
New backends, metrics, judges, and reward functions are registered via protocol interfaces — no subclassing required:
from provenir.plugins.registry import PluginRegistry
from provenir.core.abstractions import TrainingBackend
@PluginRegistry.register("my-backend")
class MyBackend(TrainingBackend):
def fit(self, config, dataset):
...
Reward Functions
Reward primitives for RLHF/GRPO training:
| Reward | Description |
|---|---|
ExactMatchReward |
Binary 1.0 if prediction matches reference |
FormatReward |
Checks structural constraints (JSON, code block, etc.) |
WeightedSumReward |
Linear combination of multiple rewards |
MinReward |
Returns the minimum across multiple rewards |
MaxReward |
Returns the maximum across multiple rewards |
ThresholdGatedReward |
Returns 0 if any component falls below a threshold |
ClampedReward |
Clamps the reward to a specified range |