New DAG (`ml/pipelines/bench_dag.py`) with three linked tasks: 1. collect.py — generates candidates, logs to MLflow 2. export_for_judge — exports pending runs for Claude Code scoring 3. compare — generates leaderboard by (model, prompt) cell Config via dag_run.conf supports all collect.py options (models, prompts, n_tips, n_scenarios, temperature, experiment name, max_model_b). New admin API endpoints (`services/api/src/routes/bench.ts`): - GET /api/bench/experiments — list tip-bench-* experiments - POST /api/bench/run — trigger DAG with custom config - GET /api/bench/runs/:experiment — list runs in experiment - GET /api/bench/leaderboard/:experiment — leaderboard by (model, prompt) All endpoints require admin auth. Human judge (Claude Code) scores are applied manually post-export; future enhancement: add webhook to DAG. Admin UI can now trigger and monitor benchmarks from a dashboard panel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
bench/ — combined model + prompt evaluation harness
Combines the work of issues #93 (model benchmark) and #95 (prompt
A/B) into one MLflow-tracked experiment. Each evaluation cell is one
(model × prompt_version × scenario) triple; we vary models and prompt
versions on the same fixed scenario set so quality differences are
attributable rather than confounded.
Pieces
| File | Purpose |
|---|---|
rubric.md |
The scoring rubric (tip-v1). Anchor for the human judge across sessions. |
scenarios.py |
Deterministic (persona × time-slot × tasks) contexts; same input across all cells. |
mlflow_client.py |
Thin httpx-based MLflow REST wrapper. Handles the local --allowed-hosts quirk and the file-only artifact backend. |
collect.py |
Phase A. Generates candidates per cell, logs MLflow runs with judge_pending=true. |
judge_cli.py |
Phase B. --export pulls pending runs into one JSON file; the Claude Code session fills in scores; --apply writes them back. |
compare.py |
Phase C. Leaderboard per (model, prompt) cell. |
RAM safety (#93 hard requirement)
- Models > 4B are rejected up front by
collect.py --max-model-b 4.0. - Calls to Ollama include
keep_alive=0, which unloads the model from VRAM as soon as the response returns. We never hold two LLM weights concurrently. - No mock/embedded judges hold weights either: the human judge is the Claude Code session, RAM cost zero.
The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end to end without paging.
Quick start
# 1. Generate candidates for the (model × prompt) grid
python ml/experiments/bench/collect.py \
--models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \
--prompts v1,v2-mentor,v3-few-shot \
--experiment tip-bench-2026-04-27 \
--n-tips 5 \
--diversity
# 2. Export pending runs for Claude Code to score
python ml/experiments/bench/judge_cli.py \
--experiment tip-bench-2026-04-27 \
--export /tmp/oo-bench-judge.json
# 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.)
# 4. Push scores back to MLflow
python ml/experiments/bench/judge_cli.py \
--experiment tip-bench-2026-04-27 \
--apply /tmp/oo-bench-judge.json
# 5. Leaderboard
python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27
Why the rubric matters
Different judging sessions need to be comparable. rubric.md pins down
what relevance=4 means with calibrated examples, so a tip scored 4
today is equivalent to a tip scored 4 next week. Without the rubric, the
"lazy human-in-the-loop" judge drifts.
Accessing results in MLflow
Each run's quality scores (relevance, actionability, tone, composite) are stored as metrics on the MLflow run — accessible via:
- MLflow UI: experiment
tip-bench-2026-04-27→ click any run → Metrics section - Leaderboard:
python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27 - Raw API:
mlflow_client.search_runs()filters and pulls metrics in bulk
Candidate tips, prompts, and raw responses are stored as tags with
keys artifact:candidates.json, artifact:prompt.txt, artifact:raw.txt
(tag fallback because the MLflow server uses a file:// artifact backend
not accessible via REST from the host).
Integrating with Airflow (#95)
A future DAG ml/pipelines/prompt_ab_eval.py will wrap collect.py
exactly as shown in the quick-start, triggered on-demand from the admin
UI or manually. The results feed into the admin leaderboard view.
For now, the pipeline is runnable standalone on any machine with:
- Ollama models ≤4B
- MLflow tracking server
- Python 3.10+