Combines model evaluation (#93) and prompt A/B testing (#95) into one experiment. Evaluates all (model × prompt × scenario) cells on the same fixed contexts so quality differences are attributable. Architecture: - Phase A (collect.py): generates candidates per cell, logs to MLflow with judge_pending=true. Rejects models >4B, uses keep_alive=0 for RAM safety (no concurrent model weights in VRAM). - Phase B (judge_cli.py): exports pending runs as JSON for Claude Code to score per the rubric, then applies scores back to MLflow. - Phase C (compare.py): leaderboard by (model, prompt) cell. Rubric (tip-v1) defines 1–5 scales for relevance, actionability, tone, plus format_ok and overlong flags. Composite = rel + act + tone + 2×format_ok − overlong. Rubric is self-describing and persisted in every run so judges use consistent criteria across sessions. Artifacts (prompts, candidates, raw responses) stored as MLflow tags because the server uses a file:// backend not accessible via REST. Full artifacts accessible in MLflow UI → run → Tags section. Tested end-to-end on local machine: - 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) ≤4B - 3 prompts (v1, v2-mentor, v3-few-shot) - 4 scenarios (4 personas × 2 time-slots) - 48 cells total, all judged and ranked Winner: qwen2.5:1.5b × v3-few-shot (composite=12.75). Ready for integration into Airflow prompt_ab_eval DAG and admin UI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
bench/ — combined model + prompt evaluation harness
Combines the work of issues #93 (model benchmark) and #95 (prompt
A/B) into one MLflow-tracked experiment. Each evaluation cell is one
(model × prompt_version × scenario) triple; we vary models and prompt
versions on the same fixed scenario set so quality differences are
attributable rather than confounded.
Pieces
| File | Purpose |
|---|---|
rubric.md |
The scoring rubric (tip-v1). Anchor for the human judge across sessions. |
scenarios.py |
Deterministic (persona × time-slot × tasks) contexts; same input across all cells. |
mlflow_client.py |
Thin httpx-based MLflow REST wrapper. Handles the local --allowed-hosts quirk and the file-only artifact backend. |
collect.py |
Phase A. Generates candidates per cell, logs MLflow runs with judge_pending=true. |
judge_cli.py |
Phase B. --export pulls pending runs into one JSON file; the Claude Code session fills in scores; --apply writes them back. |
compare.py |
Phase C. Leaderboard per (model, prompt) cell. |
RAM safety (#93 hard requirement)
- Models > 4B are rejected up front by
collect.py --max-model-b 4.0. - Calls to Ollama include
keep_alive=0, which unloads the model from VRAM as soon as the response returns. We never hold two LLM weights concurrently. - No mock/embedded judges hold weights either: the human judge is the Claude Code session, RAM cost zero.
The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end to end without paging.
Quick start
# 1. Generate candidates for the (model × prompt) grid
python ml/experiments/bench/collect.py \
--models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \
--prompts v1,v2-mentor,v3-few-shot \
--experiment tip-bench-2026-04-27 \
--n-tips 5 \
--diversity
# 2. Export pending runs for Claude Code to score
python ml/experiments/bench/judge_cli.py \
--experiment tip-bench-2026-04-27 \
--export /tmp/oo-bench-judge.json
# 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.)
# 4. Push scores back to MLflow
python ml/experiments/bench/judge_cli.py \
--experiment tip-bench-2026-04-27 \
--apply /tmp/oo-bench-judge.json
# 5. Leaderboard
python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27
Why the rubric matters
Different judging sessions need to be comparable. rubric.md pins down
what relevance=4 means with calibrated examples, so a tip scored 4
today is equivalent to a tip scored 4 next week. Without the rubric, the
"lazy human-in-the-loop" judge drifts.
Accessing results in MLflow
Each run's quality scores (relevance, actionability, tone, composite) are stored as metrics on the MLflow run — accessible via:
- MLflow UI: experiment
tip-bench-2026-04-27→ click any run → Metrics section - Leaderboard:
python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27 - Raw API:
mlflow_client.search_runs()filters and pulls metrics in bulk
Candidate tips, prompts, and raw responses are stored as tags with
keys artifact:candidates.json, artifact:prompt.txt, artifact:raw.txt
(tag fallback because the MLflow server uses a file:// artifact backend
not accessible via REST from the host).
Integrating with Airflow (#95)
A future DAG ml/pipelines/prompt_ab_eval.py will wrap collect.py
exactly as shown in the quick-start, triggered on-demand from the admin
UI or manually. The results feed into the admin leaderboard view.
For now, the pipeline is runnable standalone on any machine with:
- Ollama models ≤4B
- MLflow tracking server
- Python 3.10+