feat: ε-greedy v1 as active policy; dwell-time reward inference; offline sim framework

- Promote egreedy-v1 to active serving policy (ADR-0007): /score/egreedy + /reward/egreedy replaces linucb-v1 endpoints after offline sim shows +10.7% mean reward (−0.548 vs −0.606) - Replace explicit helpful/not_helpful feedback with dwell-time inferred reward (inferReward): dismiss=−1.0, snooze=+0.1, done<15s=−0.3, done 15s–2min=+1.0, done 2–10min=+0.6, done>10min=+0.3 - Add ml/serving ε-greedy endpoints: /score/egreedy, /reward/egreedy, /stats/egreedy/{user_id} with d=7 feature vector (base 5 + sin/cos day-of-week encoding) - Add offline simulation framework (ml/experiments/sim): rule/LLM/claude-code judges, two-phase score+reward, synthetic personas, task generator; results stored in sim_runs/sim_events - Add /admin/simulations page: start runs, live-poll status, reward curve SVG, action/persona tables - Fix egreedy day_of_week training skew: reward endpoint now uses actual dow instead of hardcoded 0 - Fix runner.py proxy bypass: httpx.Client(trust_env=False) for localhost ML calls - Add dwellMs to TipFeedbackEvent contract and bus.test.ts fixture - Schema: sim_runs, sim_events tables; tip_feedback gains dwell_ms, reward_milli columns - ADR-0006: admin console framework; ADR-0007: egreedy-v1 policy selection rationale Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 07:44:37 +00:00
parent c5ea18ec6e
commit faf44c18fc
48 changed files with 6151 additions and 40 deletions
--- a/README.md
+++ b/README.md
@@ -76,7 +76,7 @@ Goal: a single user signs in with Google, connects Todoist, and sees one random
 - [x] `integrations/todoist` — OAuth2 flow, token stored in DB, disconnect supported
 - [x] `recommender` with `RandomPolicy`; stable `POST /recommend` contract; 30s task cache
 - [x] `apps/web` — sign-in, connect, tip pages; PWA manifest + icons
- [x] Feedback endpoint (done/dismiss/snooze); marks task complete in Todoist
+- [x] Feedback: `done / snooze / dismiss`; reward inferred from dwell-time (`inferReward`); marks task complete in Todoist
 - [x] Deploy modular monolith to Agap VM via Caddy at `o.alogins.net`
 - [x] ToS + Privacy Policy pages (`/legal/terms`, `/legal/privacy`); implicit consent on sign-in
 - [x] Account deletion: revokes tokens, purges data, soft-deletes profile; button on /connect
@@ -87,10 +87,11 @@ Goal: tips are picked, not drawn from a hat — and they arrive at the right mom
 - [x] Event bus scaffold: typed in-process EventEmitter with 500-event ring buffer; subjects match future NATS JetStream — swap is mechanical
 - [x] Todoist sync emits `signals.task.synced`; tip served/feedback emit `signals.tip.*`
 - [x] Features extracted per task: `is_overdue`, `task_age_days`, `priority`; context: `hour_of_day`, `day_of_week`
- [x] `ml/serving` LinUCB bandit (d=5, alpha=1.0); per-user state persisted to disk; `/score` + `/reward` + `/reset` + `/stats` + `/features` endpoints
+- [x] `ml/serving` LinUCB (d=5) + **ε-greedy v1** (d=7, ε=0.10, day-of-week sin/cos features); per-user state persisted to disk
 - [x] `RemotePolicy` in recommender: calls ml/serving, falls back to RandomPolicy on timeout/error; logs explainability to `tip_scores`
- [x] Feedback loop: reactions mapped to rewards (done=+1, helpful=+0.5, snooze=0, not_helpful=−0.5, dismiss=−1) → online LinUCB update
- [x] In-app **helpful / not helpful** coarse signal (#62) — long-press action sheet on tip page
+- [x] Feedback loop: dwell-time inferred reward (`inferReward`) → online model update; `done` in 15 s–2 min = +1.0 (magic zone)
+- [x] Offline simulation framework (`ml/experiments/sim`): rule/LLM/claude-code judges, two-policy comparison, results persisted to `sim_runs` + `sim_events`
+- [x] **ε-greedy v1 promoted to active policy** (ADR-0007) — +10.7% mean reward vs LinUCB in offline sim
 - [x] **Web Push** (VAPID): SW, subscribe/unsubscribe API, "notify me" button on tip page
 - [x] Shadow-policy registry: run N shadow policies per request, log picks without serving them (#56)
 - [ ] Quiet-hours + dedupe for push delivery