Data Flywheel
Automatically mine evaluation failures, generate high-quality augmented variants, and feed them back into the training set — without human labelling.
Overview
The DataFlywheel closes the loop between evaluation and training:
training set ──► model ──► eval predictions
│
score < threshold?
│
▼
mine failure records
│
▼
judge: generate variants
│
▼
quality filter
│
▼
augmented training set
Each cycle produces new training examples targeted exactly at the failure modes your current model exhibits — no manual annotation required.
Quick Start
from provenir.data.flywheel import DataFlywheel, FlywheelConfig
from provenir.eval.judge import AnthropicJudge
from provenir.data.dataset import JsonlDataset
flywheel = DataFlywheel(
config=FlywheelConfig(
min_score_threshold=0.7, # mine predictions scoring below this
max_variants_per_failure=3, # generate N variants per failure
quality_filter_threshold=0.5, # keep variants scoring above this
),
judge=AnthropicJudge(),
)
train_ds = JsonlDataset.from_jsonl("data/train.jsonl")
eval_ds = JsonlDataset.from_jsonl("data/eval.jsonl")
augmented = flywheel.run(train_ds, eval_ds)
augmented.save("data/train_augmented.jsonl")
print(f"Original: {len(train_ds.records)} records")
print(f"Augmented: {len(augmented.records)} records")
Configuration
FlywheelConfig
| Field | Default | Description |
|---|---|---|
min_score_threshold |
0.7 |
Eval predictions below this score are treated as failures |
max_variants_per_failure |
3 |
Number of augmented variants to generate per failure |
quality_filter_threshold |
0.5 |
Generated variants must score above this to be included |
Using with RLAIF
The flywheel pairs naturally with the RLAIF pipeline. Run the flywheel first to enrich the training set, then run RLAIF to fine-tune on both original and augmented data:
# Step 1: Augment
augmented = DataFlywheel(config=FlywheelConfig(), judge=judge).run(train_ds, eval_ds)
# Step 2: RLAIF on augmented set
pipeline = RLAIFPipeline(judge=judge, backend=backend, base_config=config)
iterations = pipeline.run(augmented, eval_ds)
Offline Mode (StubJudge)
For CI and testing, use StubJudge to run the flywheel without any API calls:
from provenir.eval.judge import StubJudge
flywheel = DataFlywheel(
config=FlywheelConfig(max_variants_per_failure=1),
judge=StubJudge(),
)
The stub judge generates deterministic synthetic variants — useful for verifying pipeline logic without incurring API costs.