oO/ml/experiments/bench/rubric.md

# Tip-quality rubric — `tip-v1`

This file is the consistency anchor for the Claude Code judge. The same
rubric is used across every judging session so verdicts are comparable
across runs (per the lazy-judge pattern in #95).

Each candidate tip is scored on three independent 1–5 dimensions, plus
two binary flags. Score the **content of the tip itself** for the given
persona/context — do not score the rationale.

## Dimensions

### relevance — 1 to 5
How well does the tip respond to *this specific persona at this specific
time*? A generic productivity platitude is 1; a tip that hooks into the
persona's stated preferences and the actual hour-of-day is 5.

| score | description |
|-------|-------------|
| 1 | Boilerplate. Could apply to any user, any time. |
| 2 | Vaguely fits the persona but ignores context. |
| 3 | Fits the persona OR the time, not both. |
| 4 | Fits both persona and time, with one specific anchor (a task, an hour, a habit). |
| 5 | Specific to the persona's preferences AND respects the hour, with a clear hook into a candidate task or routine. |

### actionability — 1 to 5
Could the user *do this in the next 10 minutes* without further planning?
"Try to focus more" is 1; "Spend 12 minutes on the Call dentist task and
stop when the timer ends" is 5.

| score | description |
|-------|-------------|
| 1 | Pure encouragement, no action. |
| 2 | Action exists but vague ("review your tasks"). |
| 3 | Concrete verb + object, but missing the time/duration handle. |
| 4 | Concrete action with a duration or trigger ("for 10 minutes", "before lunch"). |
| 5 | Micro-action with explicit start, duration, and a stop condition. |

### tone — 1 to 5
Does the tip sound like a calm, specific mentor (the product voice) or
like a generic chatbot/coach? Penalize emoji-spam, exclamation marks,
hype words ("amazing!", "let's crush it!"), and corporate jargon.

| score | description |
|-------|-------------|
| 1 | Hype, jargon, or motivational-poster tone. |
| 2 | Polite chatbot tone, no warmth. |
| 3 | Neutral, businesslike. |
| 4 | Quiet and specific, like a coach who knows you. |
| 5 | Earned. Reads like a mentor who has seen this exact stuck-pattern before. |

## Binary flags

### format_ok — 0 or 1
1 if the *whole response* parsed as a JSON array of objects with the
required keys (`id`, `content`, `rationale`). 0 otherwise. **This is
computed automatically by `collect.py`** — judges should not override it.

### overlong — 0 or 1
1 if `content` exceeds the documented 2-sentence cap (count sentence-
ending punctuation `. ! ?`). Judges may flag this as a tiebreaker.

## Composite score

`compare.py` ranks cells by:

```
composite = relevance + actionability + tone + 2*format_ok - overlong
```

i.e. format compliance is a doubled weight (a malformed JSON is a hard
production failure regardless of how good the prose is).

## Calibration examples

(Shared with judges so a 4 means the same thing across sessions.)

**Persona**: deadline-driven (responds to overdue/high-priority,
morning-active). **Hour**: 09:00. **Tasks include**: an overdue
"Call dentist", priority 4.

- "Stay focused and make today count!" — relevance 1, actionability 1, tone 1.
- "Review your tasks and pick one that matters." — relevance 2, actionability 2, tone 3.
- "Spend the next 12 minutes on Call dentist — set a timer and stop when it rings." — relevance 5, actionability 5, tone 4.
- "It's 09:00 — you respond to overdue items best now. Block 12 minutes for Call dentist before your first meeting." — relevance 5, actionability 5, tone 5.