feat(bench): MLflow-based tip-generation benchmark harness (#93, #95)

Combines model evaluation (#93) and prompt A/B testing (#95) into one experiment. Evaluates all (model × prompt × scenario) cells on the same fixed contexts so quality differences are attributable. Architecture: - Phase A (collect.py): generates candidates per cell, logs to MLflow with judge_pending=true. Rejects models >4B, uses keep_alive=0 for RAM safety (no concurrent model weights in VRAM). - Phase B (judge_cli.py): exports pending runs as JSON for Claude Code to score per the rubric, then applies scores back to MLflow. - Phase C (compare.py): leaderboard by (model, prompt) cell. Rubric (tip-v1) defines 1–5 scales for relevance, actionability, tone, plus format_ok and overlong flags. Composite = rel + act + tone + 2×format_ok − overlong. Rubric is self-describing and persisted in every run so judges use consistent criteria across sessions. Artifacts (prompts, candidates, raw responses) stored as MLflow tags because the server uses a file:// backend not accessible via REST. Full artifacts accessible in MLflow UI → run → Tags section. Tested end-to-end on local machine: - 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) ≤4B - 3 prompts (v1, v2-mentor, v3-few-shot) - 4 scenarios (4 personas × 2 time-slots) - 48 cells total, all judged and ranked Winner: qwen2.5:1.5b × v3-few-shot (composite=12.75). Ready for integration into Airflow prompt_ab_eval DAG and admin UI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 11:48:59 +00:00
parent e40dfdcbb0
commit 556019b060
8 changed files with 1147 additions and 0 deletions
--- a/ml/experiments/bench/README.md
+++ b/ml/experiments/bench/README.md
@@ -0,0 +1,89 @@
+# `bench/` — combined model + prompt evaluation harness
+
+Combines the work of issues **#93** (model benchmark) and **#95** (prompt
+A/B) into one MLflow-tracked experiment. Each evaluation cell is one
+``(model × prompt_version × scenario)`` triple; we vary models and prompt
+versions on the same fixed scenario set so quality differences are
+attributable rather than confounded.
+
+## Pieces
+
+| File | Purpose |
+|------|---------|
+| `rubric.md`         | The scoring rubric (`tip-v1`). Anchor for the human judge across sessions. |
+| `scenarios.py`      | Deterministic ``(persona × time-slot × tasks)`` contexts; same input across all cells. |
+| `mlflow_client.py`  | Thin httpx-based MLflow REST wrapper. Handles the local ``--allowed-hosts`` quirk and the file-only artifact backend. |
+| `collect.py`        | **Phase A.** Generates candidates per cell, logs MLflow runs with `judge_pending=true`. |
+| `judge_cli.py`      | **Phase B.** `--export` pulls pending runs into one JSON file; the Claude Code session fills in scores; `--apply` writes them back. |
+| `compare.py`        | **Phase C.** Leaderboard per ``(model, prompt)`` cell. |
+
+## RAM safety (#93 hard requirement)
+
+* Models > 4B are **rejected up front** by `collect.py --max-model-b 4.0`.
+* Calls to Ollama include ``keep_alive=0``, which unloads the model from
+  VRAM as soon as the response returns. We never hold two LLM weights
+  concurrently.
+* No mock/embedded judges hold weights either: the human judge is the
+  Claude Code session, RAM cost zero.
+
+The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end
+to end without paging.
+
+## Quick start
+
+```bash
+# 1. Generate candidates for the (model × prompt) grid
+python ml/experiments/bench/collect.py \
+    --models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \
+    --prompts v1,v2-mentor,v3-few-shot \
+    --experiment tip-bench-2026-04-27 \
+    --n-tips 5 \
+    --diversity
+
+# 2. Export pending runs for Claude Code to score
+python ml/experiments/bench/judge_cli.py \
+    --experiment tip-bench-2026-04-27 \
+    --export /tmp/oo-bench-judge.json
+
+# 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.)
+
+# 4. Push scores back to MLflow
+python ml/experiments/bench/judge_cli.py \
+    --experiment tip-bench-2026-04-27 \
+    --apply /tmp/oo-bench-judge.json
+
+# 5. Leaderboard
+python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27
+```
+
+## Why the rubric matters
+
+Different judging sessions need to be comparable. `rubric.md` pins down
+what ``relevance=4`` means with calibrated examples, so a tip scored 4
+today is equivalent to a tip scored 4 next week. Without the rubric, the
+"lazy human-in-the-loop" judge drifts.
+
+## Accessing results in MLflow
+
+Each run's quality scores (relevance, actionability, tone, composite) are
+stored as **metrics** on the MLflow run — accessible via:
+
+1. **MLflow UI**: experiment `tip-bench-2026-04-27` → click any run → **Metrics** section
+2. **Leaderboard**: `python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27`
+3. **Raw API**: `mlflow_client.search_runs()` filters and pulls metrics in bulk
+
+Candidate tips, prompts, and raw responses are stored as **tags** with
+keys `artifact:candidates.json`, `artifact:prompt.txt`, `artifact:raw.txt`
+(tag fallback because the MLflow server uses a file:// artifact backend
+not accessible via REST from the host).
+
+## Integrating with Airflow (#95)
+
+A future DAG `ml/pipelines/prompt_ab_eval.py` will wrap `collect.py`
+exactly as shown in the quick-start, triggered on-demand from the admin
+UI or manually. The results feed into the admin leaderboard view.
+
+For now, the pipeline is runnable standalone on any machine with:
+- Ollama models ≤4B
+- MLflow tracking server
+- Python 3.10+