Benchmark Evaluation

Evaluate any model or fine-tuned adapter against industry-standard benchmarks using the EleutherAI lm-evaluation-harness.

Supported Benchmarks

Benchmark	Domain	Default shots	Measures
`mmlu`	Knowledge (57 subjects)	5-shot	Factual knowledge breadth
`hellaswag`	Commonsense reasoning	10-shot	Situational language understanding
`arc_easy`	Science QA	0-shot	Elementary science knowledge
`arc_challenge`	Science QA (hard)	25-shot	Advanced science reasoning
`truthfulqa_mc1`	Truthfulness	0-shot	Avoiding false beliefs
`truthfulqa_mc2`	Truthfulness	0-shot	Calibrated truthfulness
`winogrande`	Commonsense NLI	5-shot	Pronoun disambiguation
`gsm8k`	Grade-school math	8-shot	Multi-step arithmetic
`humaneval`	Code generation	0-shot	Python programming

Quick Start

CLI

# Install benchmark deps
pip install "provenir[benchmarks]"

# Run a single benchmark
provenir benchmark \
  --model-path ./my-adapter \
  --benchmarks mmlu

# Run a suite
provenir benchmark \
  --model-path ./my-adapter \
  --benchmarks mmlu hellaswag arc_easy arc_challenge gsm8k

Python

from provenir.eval.benchmarks import BenchmarkConfig, BenchmarkEvaluator
from pathlib import Path

evaluator = BenchmarkEvaluator()

results = evaluator.run_suite(
    model_path=Path("./my-adapter"),
    configs=[
        BenchmarkConfig(benchmark="mmlu", num_fewshot=5),
        BenchmarkConfig(benchmark="hellaswag", num_fewshot=10),
        BenchmarkConfig(benchmark="arc_challenge", num_fewshot=25),
        BenchmarkConfig(benchmark="gsm8k", num_fewshot=8),
    ],
)

for r in results:
    print(f"{r.benchmark:20s}  {r.score:.3f}  ({r.num_examples} examples)")

Output:

mmlu                  0.614  (14042 examples)
hellaswag             0.791  (10042 examples)
arc_challenge         0.531  (1172 examples)
gsm8k                 0.482  (1319 examples)

Configuration

BenchmarkConfig

Field	Default	Description
`benchmark`	required	Benchmark name (see table above)
`num_fewshot`	`0`	Number of in-context examples
`limit`	`None`	Limit number of evaluation examples (for fast testing)

Comparing Before and After Fine-Tuning

base_results    = evaluator.run_suite("base-model/",    configs)
tuned_results   = evaluator.run_suite("tuned-adapter/", configs)

for base, tuned in zip(base_results, tuned_results):
    delta = tuned.score - base.score
    sign  = "+" if delta >= 0 else ""
    print(f"{base.benchmark:20s}  {base.score:.3f} → {tuned.score:.3f}  ({sign}{delta:.3f})")

Using with Manifests

# Save results alongside the run manifest
provenir benchmark \
  --model-path ./my-adapter \
  --benchmarks mmlu hellaswag \
  --manifest artifacts/manifests/<run_id>.json

Benchmark results are appended to the manifest and visible via GET /manifests/{run_id} in the REST API.

Requirements

pip install "provenir[benchmarks]"

Installs lm-eval ≥ 0.4.0 and its dependencies. When lm-eval is not installed, BenchmarkEvaluator.run() raises ImportError with a clear install instruction.