# Tip-quality rubric — `tip-v1` This file is the consistency anchor for the Claude Code judge. The same rubric is used across every judging session so verdicts are comparable across runs (per the lazy-judge pattern in #95). Each candidate tip is scored on three independent 1–5 dimensions, plus two binary flags. Score the **content of the tip itself** for the given persona/context — do not score the rationale. ## Dimensions ### relevance — 1 to 5 How well does the tip respond to *this specific persona at this specific time*? A generic productivity platitude is 1; a tip that hooks into the persona's stated preferences and the actual hour-of-day is 5. | score | description | |-------|-------------| | 1 | Boilerplate. Could apply to any user, any time. | | 2 | Vaguely fits the persona but ignores context. | | 3 | Fits the persona OR the time, not both. | | 4 | Fits both persona and time, with one specific anchor (a task, an hour, a habit). | | 5 | Specific to the persona's preferences AND respects the hour, with a clear hook into a candidate task or routine. | ### actionability — 1 to 5 Could the user *do this in the next 10 minutes* without further planning? "Try to focus more" is 1; "Spend 12 minutes on the Call dentist task and stop when the timer ends" is 5. | score | description | |-------|-------------| | 1 | Pure encouragement, no action. | | 2 | Action exists but vague ("review your tasks"). | | 3 | Concrete verb + object, but missing the time/duration handle. | | 4 | Concrete action with a duration or trigger ("for 10 minutes", "before lunch"). | | 5 | Micro-action with explicit start, duration, and a stop condition. | ### tone — 1 to 5 Does the tip sound like a calm, specific mentor (the product voice) or like a generic chatbot/coach? Penalize emoji-spam, exclamation marks, hype words ("amazing!", "let's crush it!"), and corporate jargon. | score | description | |-------|-------------| | 1 | Hype, jargon, or motivational-poster tone. | | 2 | Polite chatbot tone, no warmth. | | 3 | Neutral, businesslike. | | 4 | Quiet and specific, like a coach who knows you. | | 5 | Earned. Reads like a mentor who has seen this exact stuck-pattern before. | ## Binary flags ### format_ok — 0 or 1 1 if the *whole response* parsed as a JSON array of objects with the required keys (`id`, `content`, `rationale`). 0 otherwise. **This is computed automatically by `collect.py`** — judges should not override it. ### overlong — 0 or 1 1 if `content` exceeds the documented 2-sentence cap (count sentence- ending punctuation `. ! ?`). Judges may flag this as a tiebreaker. ## Composite score `compare.py` ranks cells by: ``` composite = relevance + actionability + tone + 2*format_ok - overlong ``` i.e. format compliance is a doubled weight (a malformed JSON is a hard production failure regardless of how good the prose is). ## Calibration examples (Shared with judges so a 4 means the same thing across sessions.) **Persona**: deadline-driven (responds to overdue/high-priority, morning-active). **Hour**: 09:00. **Tasks include**: an overdue "Call dentist", priority 4. - "Stay focused and make today count!" — relevance 1, actionability 1, tone 1. - "Review your tasks and pick one that matters." — relevance 2, actionability 2, tone 3. - "Spend the next 12 minutes on Call dentist — set a timer and stop when it rings." — relevance 5, actionability 5, tone 4. - "It's 09:00 — you respond to overdue items best now. Block 12 minutes for Call dentist before your first meeting." — relevance 5, actionability 5, tone 5.