feat(bench): MLflow-based tip-generation benchmark harness (#93, #95)

Combines model evaluation (#93) and prompt A/B testing (#95) into one experiment. Evaluates all (model × prompt × scenario) cells on the same fixed contexts so quality differences are attributable. Architecture: - Phase A (collect.py): generates candidates per cell, logs to MLflow with judge_pending=true. Rejects models >4B, uses keep_alive=0 for RAM safety (no concurrent model weights in VRAM). - Phase B (judge_cli.py): exports pending runs as JSON for Claude Code to score per the rubric, then applies scores back to MLflow. - Phase C (compare.py): leaderboard by (model, prompt) cell. Rubric (tip-v1) defines 1–5 scales for relevance, actionability, tone, plus format_ok and overlong flags. Composite = rel + act + tone + 2×format_ok − overlong. Rubric is self-describing and persisted in every run so judges use consistent criteria across sessions. Artifacts (prompts, candidates, raw responses) stored as MLflow tags because the server uses a file:// backend not accessible via REST. Full artifacts accessible in MLflow UI → run → Tags section. Tested end-to-end on local machine: - 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) ≤4B - 3 prompts (v1, v2-mentor, v3-few-shot) - 4 scenarios (4 personas × 2 time-slots) - 48 cells total, all judged and ranked Winner: qwen2.5:1.5b × v3-few-shot (composite=12.75). Ready for integration into Airflow prompt_ab_eval DAG and admin UI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 11:48:59 +00:00
parent e40dfdcbb0
commit 556019b060
8 changed files with 1147 additions and 0 deletions
--- a/ml/experiments/bench/rubric.md
+++ b/ml/experiments/bench/rubric.md
@@ -0,0 +1,85 @@
+# Tip-quality rubric — `tip-v1`
+
+This file is the consistency anchor for the Claude Code judge. The same
+rubric is used across every judging session so verdicts are comparable
+across runs (per the lazy-judge pattern in #95).
+
+Each candidate tip is scored on three independent 1–5 dimensions, plus
+two binary flags. Score the **content of the tip itself** for the given
+persona/context — do not score the rationale.
+
+## Dimensions
+
+### relevance — 1 to 5
+How well does the tip respond to *this specific persona at this specific
+time*? A generic productivity platitude is 1; a tip that hooks into the
+persona's stated preferences and the actual hour-of-day is 5.
+
+| score | description |
+|-------|-------------|
+| 1 | Boilerplate. Could apply to any user, any time. |
+| 2 | Vaguely fits the persona but ignores context. |
+| 3 | Fits the persona OR the time, not both. |
+| 4 | Fits both persona and time, with one specific anchor (a task, an hour, a habit). |
+| 5 | Specific to the persona's preferences AND respects the hour, with a clear hook into a candidate task or routine. |
+
+### actionability — 1 to 5
+Could the user *do this in the next 10 minutes* without further planning?
+"Try to focus more" is 1; "Spend 12 minutes on the Call dentist task and
+stop when the timer ends" is 5.
+
+| score | description |
+|-------|-------------|
+| 1 | Pure encouragement, no action. |
+| 2 | Action exists but vague ("review your tasks"). |
+| 3 | Concrete verb + object, but missing the time/duration handle. |
+| 4 | Concrete action with a duration or trigger ("for 10 minutes", "before lunch"). |
+| 5 | Micro-action with explicit start, duration, and a stop condition. |
+
+### tone — 1 to 5
+Does the tip sound like a calm, specific mentor (the product voice) or
+like a generic chatbot/coach? Penalize emoji-spam, exclamation marks,
+hype words ("amazing!", "let's crush it!"), and corporate jargon.
+
+| score | description |
+|-------|-------------|
+| 1 | Hype, jargon, or motivational-poster tone. |
+| 2 | Polite chatbot tone, no warmth. |
+| 3 | Neutral, businesslike. |
+| 4 | Quiet and specific, like a coach who knows you. |
+| 5 | Earned. Reads like a mentor who has seen this exact stuck-pattern before. |
+
+## Binary flags
+
+### format_ok — 0 or 1
+1 if the *whole response* parsed as a JSON array of objects with the
+required keys (`id`, `content`, `rationale`). 0 otherwise. **This is
+computed automatically by `collect.py`** — judges should not override it.
+
+### overlong — 0 or 1
+1 if `content` exceeds the documented 2-sentence cap (count sentence-
+ending punctuation `. ! ?`). Judges may flag this as a tiebreaker.
+
+## Composite score
+
+`compare.py` ranks cells by:
+
+```
+composite = relevance + actionability + tone + 2*format_ok - overlong
+```
+
+i.e. format compliance is a doubled weight (a malformed JSON is a hard
+production failure regardless of how good the prose is).
+
+## Calibration examples
+
+(Shared with judges so a 4 means the same thing across sessions.)
+
+**Persona**: deadline-driven (responds to overdue/high-priority,
+morning-active). **Hour**: 09:00. **Tasks include**: an overdue
+"Call dentist", priority 4.
+
+- "Stay focused and make today count!" — relevance 1, actionability 1, tone 1.
+- "Review your tasks and pick one that matters." — relevance 2, actionability 2, tone 3.
+- "Spend the next 12 minutes on Call dentist — set a timer and stop when it rings." — relevance 5, actionability 5, tone 4.
+- "It's 09:00 — you respond to overdue items best now. Block 12 minutes for Call dentist before your first meeting." — relevance 5, actionability 5, tone 5.