Skip to content

Core Concepts

This page explains the key abstractions in Provenir and how they fit together.


RunManifest — Reproducibility by Default

Every training run in Provenir produces a RunManifest: a content-addressed, tamper-evident record that captures:

Field What it captures
run_id UUID for this specific run
config_hash SHA-256 of the serialised RunConfig
dataset_hash SHA-256 of the training dataset
git_sha Current HEAD commit (if in a git repo)
seed Random seed used
timestamp ISO-8601 start time
provenance Chain of parent run IDs (for RLAIF iterations)

Any run can be reproduced exactly:

provenir reproduce artifacts/manifests/<run_id>.json

Provenir will verify the config hash and dataset hash before starting and raise an error if either has changed.

The manifest is stored as a JSON file and surfaced via the REST API at GET /manifests/{run_id}.


RunConfig — Unified Training Configuration

RunConfig is a Pydantic model that covers every aspect of a training job:

name: my-run
backend: trl          # trl | stub
seed: 42
model_name_or_path: meta-llama/Llama-3.2-1B
max_steps: 1000
batch_size: 8

peft:
  rank: 16
  alpha: 32
  target_modules: ["q_proj", "v_proj"]
  load_in_4bit: false   # QLoRA

distributed:
  strategy: fsdp        # fsdp | deepspeed | ddp | none
  num_gpus: 8

observability_backend: wandb

Load from YAML:

from provenir.core.config import RunConfig

config = RunConfig.from_yaml("my_run.yaml")

Or construct in Python:

from provenir.core.config import RunConfig, PEFTConfig

config = RunConfig(
    name="my-run",
    model_name_or_path="meta-llama/Llama-3.2-1B",
    peft=PEFTConfig(rank=16, alpha=32),
)

Training Backends

Backends implement the TrainingBackend protocol and are selected by the backend field in RunConfig.

Backend When to use
trl Production training — SFT, DPO, LoRA, QLoRA via HuggingFace TRL
stub Testing and CI — returns immediately with a fake manifest

The TRL backend supports:

  • SFT (Supervised Fine-Tuning) — default algorithm
  • DPO (Direct Preference Optimization) — for preference datasets
  • LoRA — parameter-efficient fine-tuning via peft config
  • QLoRA — 4-bit or 8-bit quantized LoRA
  • rsLoRA — rank-stabilised LoRA scaling (use_rslora: true)

Prompt Templates

Six built-in formats, switchable at runtime:

Template Chat format Use case
alpaca ### Instruction: … ### Response: Classic instruction tuning
chatml <|im_start|>user … <|im_end|> Multi-turn, OpenAI-compatible
llama3 <|begin_of_text|> … <|eot_id|> Llama 3 official format
mistral [INST] … [/INST] Mistral instruction format
phi3 <|user|> … <|assistant|> Microsoft Phi-3
raw_completion {prompt}{response} Plain completion, no special tokens
from provenir.data.templates import TEMPLATE_REGISTRY

# List available templates
names = TEMPLATE_REGISTRY.list_names()

# Format a record
text = TEMPLATE_REGISTRY.format("llama3", {
    "system": "You are a helpful assistant.",
    "prompt": "What is LoRA?",
    "response": "LoRA is a parameter-efficient fine-tuning method...",
})

Evaluation Layer

Metrics

All metrics implement the MetricFn protocol:

Metric Description
ExactMatch Binary string equality
TokenF1 Token-level F1 score
BLEU4 BLEU-4 (4-gram overlap)
ROUGE_L Longest common subsequence recall

All results include Wilson 95% confidence intervals — not just the point estimate. This catches "improvements" that are just noise.

from provenir.eval.harness import MultiMetricEvaluator

result = MultiMetricEvaluator().evaluate(dataset, predictions)
em = result.metrics["exact_match"]
print(f"{em.mean:.3f}  [{em.ci_lower:.3f}, {em.ci_upper:.3f}]")

Regression Gate

The RegressionGate blocks promotion if eval scores drop below the baseline by more than a configured tolerance:

from provenir.eval.regression import RegressionGate

gate = RegressionGate(tolerance=0.02)   # allow up to 2% regression
gate.check(new_result, baseline_result)  # raises RegressionError if gate trips

EvalCallback

EvalCallback hooks into training to run evaluation at configurable intervals and optionally stop training early if quality stops improving:

from provenir.train.eval_callback import EvalCallback, EvalCallbackConfig

callback = EvalCallback(
    config=EvalCallbackConfig(
        eval_every_n_steps=100,
        early_stopping_patience=3,
        regression_tolerance=0.02,
    ),
    evaluator=MultiMetricEvaluator(),
    eval_dataset=eval_ds,
)

RAG Metrics

For retrieval-augmented generation evaluation:

Metric What it measures
faithfulness Fraction of answer tokens grounded in context
context_precision Overlap between context and question
answer_relevance Fraction of question tokens covered by answer
from provenir.eval.rag_metrics import RAGEvaluator

results = RAGEvaluator().evaluate([{
    "question": "What is the capital of France?",
    "answer": "Paris is the capital of France.",
    "context": "France is a country in Western Europe. Its capital is Paris.",
}])

LLM-as-Judge

Provenir ships four judge implementations:

Judge Use case
StubJudge Deterministic, no API calls — for CI and offline testing
CachedJudge SHA-256 disk cache wrapping any other judge
AnthropicJudge Pairwise and rubric scoring via Claude
OpenAIJudge Pairwise and rubric scoring via GPT-4o

All judges implement the LLMJudge protocol:

from provenir.eval.judge import AnthropicJudge, CachedJudge

judge = CachedJudge(AnthropicJudge(model="claude-haiku-4-5-20251001"))

# Pairwise: which response is better?
pref = judge.score_pairwise("What is LoRA?", response_a, response_b)
# → Preference(preferred='a', confidence=0.91, rationale='Response A is more precise...')

# Rubric: score against specific criteria
scores = judge.score_rubric("What is LoRA?", response, criteria=[
    "factual accuracy", "clarity", "conciseness"
])

RLAIF Pipeline

Provenir's unique differentiator: an end-to-end loop that generates preference data, trains via DPO, evaluates, and iterates — without any human labelling.

dataset
generate N response variants per prompt
LLM judge: pairwise ranking of variants
build (chosen, rejected) preference pairs
DPO training on preference pairs
automatic evaluation against held-out set
regression gate (stop if quality regresses)
next iteration (up to n_iterations)

See the RLAIF guide for full configuration and usage.


Data Flywheel

The DataFlywheel automatically mines prediction failures, generates augmented variants via the judge, filters by quality, and augments the training set:

eval failures  ──►  judge: generate variants  ──►  quality filter  ──►  augmented dataset

See the Data Flywheel guide.


Governance

Audit Log

Every significant event in Provenir is logged to an append-only JSONL file:

from provenir.governance.audit import AuditLogger

audit = AuditLogger("artifacts/audit")
audit.log("training_complete", actor="ci-bot", run_id=manifest.run_id, steps=500)

The audit log is exposed via GET /audit in the REST API and provenir audit in the CLI.

PII & Secret Scanning

from provenir.governance.pii import PIIScanner, PIIMasker
from provenir.governance.scanners import SecretScanner

# Scan before training
pii_report = PIIScanner().scan(dataset)
secret_report = SecretScanner().scan(dataset)

if pii_report.has_pii:
    dataset = PIIMasker().mask(dataset)

Model Cards

from provenir.governance.model_card import ModelCardGenerator

card = ModelCardGenerator().generate(manifest, eval_result)
card.save("MODEL_CARD.md")

Or via CLI:

provenir model-card --manifest artifacts/manifests/<run_id>.json

Plugin Architecture

New backends, metrics, judges, and reward functions are registered via protocol interfaces — no subclassing required:

from provenir.plugins.registry import PluginRegistry
from provenir.core.abstractions import TrainingBackend

@PluginRegistry.register("my-backend")
class MyBackend(TrainingBackend):
    def fit(self, config, dataset):
        ...

Reward Functions

Reward primitives for RLHF/GRPO training:

Reward Description
ExactMatchReward Binary 1.0 if prediction matches reference
FormatReward Checks structural constraints (JSON, code block, etc.)
WeightedSumReward Linear combination of multiple rewards
MinReward Returns the minimum across multiple rewards
MaxReward Returns the maximum across multiple rewards
ThresholdGatedReward Returns 0 if any component falls below a threshold
ClampedReward Clamps the reward to a specified range
from provenir.rewards.primitives import WeightedSumReward, ExactMatchReward, FormatReward

reward = WeightedSumReward(rewards=[
    (ExactMatchReward(), 0.7),
    (FormatReward(pattern=r"^\[.*\]$"), 0.3),
])
score = reward.score(prediction, reference)