feat: LLM tip quality monitoring dashboard in admin #92

New Issue

alvis · 2026-04-17T08:12:37Z

alvis commented

2026-04-17 08:12:37 +00:00

Goal

Make prompt iteration data-driven. The admin can see which model + prompt version produces the best user reactions without running a formal A/B test.

Location

/admin/reward-analytics — extend with a new "LLM quality" section

Metrics to show

Done rate / snooze rate / dismiss rate broken down by:
- source (llm vs task_direct vs fallback)
- model (qwen2.5:7b, llama3.2:3b, ...)
- prompt_version (v1, v2, ...)
Mean dwell time per model/version
LLM parse failure rate (from source=llm_failed rows)
Tip kind distribution (task / advice / insight / reminder) over time

Implementation notes

All data comes from tip_scores table — requires #89 schema columns and #91 versioning
Use Tremor BarList or GroupedBar for the multi-dimension breakdown
Date range filter same as existing reward analytics page
Link from /admin/experiments MLOps hub page

Why this matters

Without this dashboard, prompt improvements are blind. With it, you can ship a new prompt version to 10% of tips, watch this chart for 48h, and decide in the admin panel.

## Goal Make prompt iteration data-driven. The admin can see which model + prompt version produces the best user reactions without running a formal A/B test. ## Location `/admin/reward-analytics` — extend with a new "LLM quality" section ## Metrics to show - Done rate / snooze rate / dismiss rate broken down by: - `source` (llm vs task_direct vs fallback) - `model` (qwen2.5:7b, llama3.2:3b, ...) - `prompt_version` (v1, v2, ...) - Mean dwell time per model/version - LLM parse failure rate (from `source=llm_failed` rows) - Tip kind distribution (task / advice / insight / reminder) over time ## Implementation notes - All data comes from `tip_scores` table — requires #89 schema columns and #91 versioning - Use Tremor BarList or GroupedBar for the multi-dimension breakdown - Date range filter same as existing reward analytics page - Link from `/admin/experiments` MLOps hub page ## Why this matters Without this dashboard, prompt improvements are blind. With it, you can ship a new prompt version to 10% of tips, watch this chart for 48h, and decide in the admin panel.

alvis added this to the M2 — AI tips + multi-source signals milestone 2026-04-17 08:12:37 +00:00

alvis referenced this issue from a commit

2026-04-24 15:24:58 +00:00

feat(admin): LLM tip quality dashboard — per-model/prompt/kind breakdowns

alvis closed this issue

2026-04-24 15:24:58 +00:00

alvis referenced this issue from a commit

2026-04-24 15:44:05 +00:00

feat(ml): prompt registry + per-request variant selection

Sign in to join this conversation.