Files
oO/ml/experiments/bench/rubric.md
alvis 556019b060 feat(bench): MLflow-based tip-generation benchmark harness (#93, #95)
Combines model evaluation (#93) and prompt A/B testing (#95) into one
experiment. Evaluates all (model × prompt × scenario) cells on the same
fixed contexts so quality differences are attributable.

Architecture:
- Phase A (collect.py): generates candidates per cell, logs to MLflow
  with judge_pending=true. Rejects models >4B, uses keep_alive=0 for
  RAM safety (no concurrent model weights in VRAM).
- Phase B (judge_cli.py): exports pending runs as JSON for Claude Code
  to score per the rubric, then applies scores back to MLflow.
- Phase C (compare.py): leaderboard by (model, prompt) cell.

Rubric (tip-v1) defines 1–5 scales for relevance, actionability, tone,
plus format_ok and overlong flags. Composite = rel + act + tone +
2×format_ok − overlong. Rubric is self-describing and persisted in every
run so judges use consistent criteria across sessions.

Artifacts (prompts, candidates, raw responses) stored as MLflow tags
because the server uses a file:// backend not accessible via REST. Full
artifacts accessible in MLflow UI → run → Tags section.

Tested end-to-end on local machine:
- 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) ≤4B
- 3 prompts (v1, v2-mentor, v3-few-shot)
- 4 scenarios (4 personas × 2 time-slots)
- 48 cells total, all judged and ranked

Winner: qwen2.5:1.5b × v3-few-shot (composite=12.75).

Ready for integration into Airflow prompt_ab_eval DAG and admin UI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 11:48:59 +00:00

3.5 KiB
Raw Blame History

Tip-quality rubric — tip-v1

This file is the consistency anchor for the Claude Code judge. The same rubric is used across every judging session so verdicts are comparable across runs (per the lazy-judge pattern in #95).

Each candidate tip is scored on three independent 15 dimensions, plus two binary flags. Score the content of the tip itself for the given persona/context — do not score the rationale.

Dimensions

relevance — 1 to 5

How well does the tip respond to this specific persona at this specific time? A generic productivity platitude is 1; a tip that hooks into the persona's stated preferences and the actual hour-of-day is 5.

score description
1 Boilerplate. Could apply to any user, any time.
2 Vaguely fits the persona but ignores context.
3 Fits the persona OR the time, not both.
4 Fits both persona and time, with one specific anchor (a task, an hour, a habit).
5 Specific to the persona's preferences AND respects the hour, with a clear hook into a candidate task or routine.

actionability — 1 to 5

Could the user do this in the next 10 minutes without further planning? "Try to focus more" is 1; "Spend 12 minutes on the Call dentist task and stop when the timer ends" is 5.

score description
1 Pure encouragement, no action.
2 Action exists but vague ("review your tasks").
3 Concrete verb + object, but missing the time/duration handle.
4 Concrete action with a duration or trigger ("for 10 minutes", "before lunch").
5 Micro-action with explicit start, duration, and a stop condition.

tone — 1 to 5

Does the tip sound like a calm, specific mentor (the product voice) or like a generic chatbot/coach? Penalize emoji-spam, exclamation marks, hype words ("amazing!", "let's crush it!"), and corporate jargon.

score description
1 Hype, jargon, or motivational-poster tone.
2 Polite chatbot tone, no warmth.
3 Neutral, businesslike.
4 Quiet and specific, like a coach who knows you.
5 Earned. Reads like a mentor who has seen this exact stuck-pattern before.

Binary flags

format_ok — 0 or 1

1 if the whole response parsed as a JSON array of objects with the required keys (id, content, rationale). 0 otherwise. This is computed automatically by collect.py — judges should not override it.

overlong — 0 or 1

1 if content exceeds the documented 2-sentence cap (count sentence- ending punctuation . ! ?). Judges may flag this as a tiebreaker.

Composite score

compare.py ranks cells by:

composite = relevance + actionability + tone + 2*format_ok - overlong

i.e. format compliance is a doubled weight (a malformed JSON is a hard production failure regardless of how good the prose is).

Calibration examples

(Shared with judges so a 4 means the same thing across sessions.)

Persona: deadline-driven (responds to overdue/high-priority, morning-active). Hour: 09:00. Tasks include: an overdue "Call dentist", priority 4.

  • "Stay focused and make today count!" — relevance 1, actionability 1, tone 1.
  • "Review your tasks and pick one that matters." — relevance 2, actionability 2, tone 3.
  • "Spend the next 12 minutes on Call dentist — set a timer and stop when it rings." — relevance 5, actionability 5, tone 4.
  • "It's 09:00 — you respond to overdue items best now. Block 12 minutes for Call dentist before your first meeting." — relevance 5, actionability 5, tone 5.