Files
oO/ml/experiments/bench/rubric.md
alvis 556019b060 feat(bench): MLflow-based tip-generation benchmark harness (#93, #95)
Combines model evaluation (#93) and prompt A/B testing (#95) into one
experiment. Evaluates all (model × prompt × scenario) cells on the same
fixed contexts so quality differences are attributable.

Architecture:
- Phase A (collect.py): generates candidates per cell, logs to MLflow
  with judge_pending=true. Rejects models >4B, uses keep_alive=0 for
  RAM safety (no concurrent model weights in VRAM).
- Phase B (judge_cli.py): exports pending runs as JSON for Claude Code
  to score per the rubric, then applies scores back to MLflow.
- Phase C (compare.py): leaderboard by (model, prompt) cell.

Rubric (tip-v1) defines 1–5 scales for relevance, actionability, tone,
plus format_ok and overlong flags. Composite = rel + act + tone +
2×format_ok − overlong. Rubric is self-describing and persisted in every
run so judges use consistent criteria across sessions.

Artifacts (prompts, candidates, raw responses) stored as MLflow tags
because the server uses a file:// backend not accessible via REST. Full
artifacts accessible in MLflow UI → run → Tags section.

Tested end-to-end on local machine:
- 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) ≤4B
- 3 prompts (v1, v2-mentor, v3-few-shot)
- 4 scenarios (4 personas × 2 time-slots)
- 48 cells total, all judged and ranked

Winner: qwen2.5:1.5b × v3-few-shot (composite=12.75).

Ready for integration into Airflow prompt_ab_eval DAG and admin UI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 11:48:59 +00:00

86 lines
3.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Tip-quality rubric — `tip-v1`
This file is the consistency anchor for the Claude Code judge. The same
rubric is used across every judging session so verdicts are comparable
across runs (per the lazy-judge pattern in #95).
Each candidate tip is scored on three independent 15 dimensions, plus
two binary flags. Score the **content of the tip itself** for the given
persona/context — do not score the rationale.
## Dimensions
### relevance — 1 to 5
How well does the tip respond to *this specific persona at this specific
time*? A generic productivity platitude is 1; a tip that hooks into the
persona's stated preferences and the actual hour-of-day is 5.
| score | description |
|-------|-------------|
| 1 | Boilerplate. Could apply to any user, any time. |
| 2 | Vaguely fits the persona but ignores context. |
| 3 | Fits the persona OR the time, not both. |
| 4 | Fits both persona and time, with one specific anchor (a task, an hour, a habit). |
| 5 | Specific to the persona's preferences AND respects the hour, with a clear hook into a candidate task or routine. |
### actionability — 1 to 5
Could the user *do this in the next 10 minutes* without further planning?
"Try to focus more" is 1; "Spend 12 minutes on the Call dentist task and
stop when the timer ends" is 5.
| score | description |
|-------|-------------|
| 1 | Pure encouragement, no action. |
| 2 | Action exists but vague ("review your tasks"). |
| 3 | Concrete verb + object, but missing the time/duration handle. |
| 4 | Concrete action with a duration or trigger ("for 10 minutes", "before lunch"). |
| 5 | Micro-action with explicit start, duration, and a stop condition. |
### tone — 1 to 5
Does the tip sound like a calm, specific mentor (the product voice) or
like a generic chatbot/coach? Penalize emoji-spam, exclamation marks,
hype words ("amazing!", "let's crush it!"), and corporate jargon.
| score | description |
|-------|-------------|
| 1 | Hype, jargon, or motivational-poster tone. |
| 2 | Polite chatbot tone, no warmth. |
| 3 | Neutral, businesslike. |
| 4 | Quiet and specific, like a coach who knows you. |
| 5 | Earned. Reads like a mentor who has seen this exact stuck-pattern before. |
## Binary flags
### format_ok — 0 or 1
1 if the *whole response* parsed as a JSON array of objects with the
required keys (`id`, `content`, `rationale`). 0 otherwise. **This is
computed automatically by `collect.py`** — judges should not override it.
### overlong — 0 or 1
1 if `content` exceeds the documented 2-sentence cap (count sentence-
ending punctuation `. ! ?`). Judges may flag this as a tiebreaker.
## Composite score
`compare.py` ranks cells by:
```
composite = relevance + actionability + tone + 2*format_ok - overlong
```
i.e. format compliance is a doubled weight (a malformed JSON is a hard
production failure regardless of how good the prose is).
## Calibration examples
(Shared with judges so a 4 means the same thing across sessions.)
**Persona**: deadline-driven (responds to overdue/high-priority,
morning-active). **Hour**: 09:00. **Tasks include**: an overdue
"Call dentist", priority 4.
- "Stay focused and make today count!" — relevance 1, actionability 1, tone 1.
- "Review your tasks and pick one that matters." — relevance 2, actionability 2, tone 3.
- "Spend the next 12 minutes on Call dentist — set a timer and stop when it rings." — relevance 5, actionability 5, tone 4.
- "It's 09:00 — you respond to overdue items best now. Block 12 minutes for Call dentist before your first meeting." — relevance 5, actionability 5, tone 5.