feat: ε-greedy v1 as active policy; dwell-time reward inference; offline sim framework

- Promote egreedy-v1 to active serving policy (ADR-0007): /score/egreedy + /reward/egreedy replaces linucb-v1 endpoints after offline sim shows +10.7% mean reward (−0.548 vs −0.606) - Replace explicit helpful/not_helpful feedback with dwell-time inferred reward (inferReward): dismiss=−1.0, snooze=+0.1, done<15s=−0.3, done 15s–2min=+1.0, done 2–10min=+0.6, done>10min=+0.3 - Add ml/serving ε-greedy endpoints: /score/egreedy, /reward/egreedy, /stats/egreedy/{user_id} with d=7 feature vector (base 5 + sin/cos day-of-week encoding) - Add offline simulation framework (ml/experiments/sim): rule/LLM/claude-code judges, two-phase score+reward, synthetic personas, task generator; results stored in sim_runs/sim_events - Add /admin/simulations page: start runs, live-poll status, reward curve SVG, action/persona tables - Fix egreedy day_of_week training skew: reward endpoint now uses actual dow instead of hardcoded 0 - Fix runner.py proxy bypass: httpx.Client(trust_env=False) for localhost ML calls - Add dwellMs to TipFeedbackEvent contract and bus.test.ts fixture - Schema: sim_runs, sim_events tables; tip_feedback gains dwell_ms, reward_milli columns - ADR-0006: admin console framework; ADR-0007: egreedy-v1 policy selection rationale Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 07:44:37 +00:00
parent c5ea18ec6e
commit faf44c18fc
48 changed files with 6151 additions and 40 deletions
--- a/docs/adr/0006-admin-console-framework.md
+++ b/docs/adr/0006-admin-console-framework.md
@@ -0,0 +1,59 @@
+# ADR-0006: Admin console framework — Next.js 15 + Tremor + shadcn/ui + embed specialist tools
+
+## Status
+Accepted — 2026-04-15
+
+## Context
+M1 ships a bandit-driven recommender, an event bus, and a live feedback loop. Without a cockpit to observe these systems, every model change ships blind. An admin console is needed to:
+
+1. **Observe** — DAU/WAU, tip outcomes, reaction rates, LinUCB arm stats, feature distributions
+2. **Inspect** — per-user identity, consents, integrations, reward history
+3. **Act** — revoke tokens, replay signals, reset a per-user bandit, promote a policy
+4. **Audit** — every operator action is logged
+
+The team is two people. The stack is TypeScript/React/Tailwind. Any framework that forks the stack creates a context-switch tax and a second deployment surface.
+
+## Decision
+
+### App shell — `apps/admin`, Next.js 15, App Router
+
+Same stack as `apps/web`. Reuses `packages/shared-types`, the Auth.js session cookie, and the API rewrite convention. Deployed at `admin.o.alogins.net` behind Caddy, port 3080 in dev.
+
+### UI libraries
+
+| Layer | Library | Reason |
+|-------|---------|--------|
+| Charts / KPI | **Tremor** | Analytics-first React + Tailwind components (KPI cards, time-series, bar lists). Designed for dashboards, not bolted on. |
+| CRUD primitives | **shadcn/ui** | Copy-paste Radix components; forms, dialogs, command palette. No version lock-in — code lives in-repo. |
+| Heavy grids | **TanStack Table v8** | Sortable / paginated / virtualized tables for events, users, tips. |
+| Extra charts | **Recharts** | Fallback where Tremor falls short (histograms, distributions). |
+
+### Embed, don't rebuild
+
+Specialized tooling is **reverse-proxied into the admin shell**, not reimplemented:
+
+- **MLflow UI** → `/admin/models` (Caddy sub-path proxy)
+- **Grafana panels** → `/admin/infra` (iframed or embedded panels)
+- **Marimo notebooks** → launch-out link from admin
+
+This prevents reimplementing artifact browsers or graph renderers we'd never do as well.
+
+### AuthZ
+
+`profile.role` column on the `users` table (values: `'user'` | `'admin'`). First admin seeded via `ADMIN_SEED_EMAIL` env var at startup. Admin-only gate in Next.js middleware checks the session and the role returned by `GET /api/user/me`. Every write action through the admin API is appended to an `admin_actions` audit log.
+
+### Rejected alternatives
+
+| Option | Rejected because |
+|--------|-----------------|
+| Retool / AppSmith | Admin logic leaves the repo; weak analytics affordances |
+| Streamlit / Gradio | Python-first; splits the frontend stack; thin RBAC |
+| React-admin / Refine.dev | Strong CRUD scaffolding, analytics views feel bolted on |
+| Superset / Metabase as the admin surface | Excellent BI, poor operational writes; plan: adopt Superset in M4 for BI alongside batch pipelines |
+
+## Consequences
+
+- One more Next.js app in the monorepo. Build/dev added to Turborepo.
+- Tremor + shadcn/ui are added as dependencies. shadcn components are copied into `apps/admin/src/components/ui/` — no runtime version coupling.
+- MLflow and Grafana must be reachable from the Caddy reverse proxy; they are not embedded in the JS bundle.
+- `admin_actions` audit log grows unboundedly — needs a retention policy before M4.
--- a/docs/adr/0007-egreedy-v1-active-policy.md
+++ b/docs/adr/0007-egreedy-v1-active-policy.md
@@ -0,0 +1,47 @@
+# ADR-0007: ε-greedy v1 as the active recommendation policy
+
+## Status
+Accepted — 2026-04-16
+
+## Context
+
+M1 shipped LinUCB (d=5, α=1.0) as the first learned policy via `ml/serving /score`. After the M1 admin console landed, we ran an offline simulation to compare LinUCB against a new ε-greedy ridge-regression policy before deciding which to keep live.
+
+**ε-greedy v1 design:**
+- Ridge regression estimator, θ updated online (equivalent to LinUCB without the UCB bonus).
+- d=7 feature vector: base 5 (is\_overdue, task\_age\_days, priority, hour\_of\_day, bias) + sin/cos encoding of day\_of\_week.
+- ε=0.10 random exploration; 90% argmax(θ·x).
+- Separate per-user state files (`{user}_egreedy.json`), independent of LinUCB state.
+
+**Simulation setup (rule judge, seed=42):**
+- 5 synthetic personas × 20 rounds × 8 tasks/round = 100 judgments per policy.
+- Reward inferred from dwell-time (same `inferReward` logic as production): dismiss=−1, snooze=+0.1, done<15 s=−0.3, done 15 s–2 min=+1.0, done 2–10 min=+0.6, done>10 min=+0.3.
+- Both policies started from blank state (no warm-up).
+
+**Results:**
+
+| Policy | Total reward | Mean reward/pull | Pulls |
+|--------|-------------|-----------------|-------|
+| egreedy-v1 | −54.80 | −0.548 | 100 |
+| linucb-v1  | −60.60 | −0.606 | 100 |
+
+Winner: **egreedy-v1** (+10.7% mean reward).
+
+Both policies produce negative mean rewards under the dwell-time model — expected: most simulated users don't act in the 15s–2min magic zone on cold models. The gap widens from round 8 onward, consistent with LinUCB's UCB exploration bonus over-favouring high-uncertainty dimensions (is\_overdue, task\_age\_days) regardless of persona fit.
+
+## Decision
+
+Promote **egreedy-v1** to the active serving policy:
+- `POST /recommend` calls `/score/egreedy` instead of `/score`.
+- Feedback loop calls `/reward/egreedy`.
+- LinUCB (`/score`, `/reward`) remains deployed in `ml/serving` as a shadow-eligible fallback.
+
+The simulation does not replace online A/B testing; it is evidence that egreedy-v1 is worth promoting before collecting real-user signal. A future milestone will run live A/B once we have enough daily active users for statistical power.
+
+## Consequences
+
+- Recommendation calls and reward updates now hit the egreedy endpoints only.
+- LinUCB state is preserved on disk; re-activation is a one-line change.
+- `tip_scores.policy` will log `egreedy-v1` for new serves; historical rows remain `linucb-v1` or `random`.
+- The dwell-time reward model (`inferReward`) is now the canonical feedback signal for both online updates and simulation. Explicit helpful/not\_helpful signals are removed.
+- Next evaluation gate: once ≥500 real tips served with egreedy-v1, compare reward distribution to the LinUCB historical baseline in the admin Reward Analytics page before deciding on next policy iteration.