Files
oO/docs/adr/0007-egreedy-v1-active-policy.md
alvis faf44c18fc feat: ε-greedy v1 as active policy; dwell-time reward inference; offline sim framework
- Promote egreedy-v1 to active serving policy (ADR-0007): /score/egreedy + /reward/egreedy
  replaces linucb-v1 endpoints after offline sim shows +10.7% mean reward (−0.548 vs −0.606)
- Replace explicit helpful/not_helpful feedback with dwell-time inferred reward (inferReward):
  dismiss=−1.0, snooze=+0.1, done<15s=−0.3, done 15s–2min=+1.0, done 2–10min=+0.6, done>10min=+0.3
- Add ml/serving ε-greedy endpoints: /score/egreedy, /reward/egreedy, /stats/egreedy/{user_id}
  with d=7 feature vector (base 5 + sin/cos day-of-week encoding)
- Add offline simulation framework (ml/experiments/sim): rule/LLM/claude-code judges,
  two-phase score+reward, synthetic personas, task generator; results stored in sim_runs/sim_events
- Add /admin/simulations page: start runs, live-poll status, reward curve SVG, action/persona tables
- Fix egreedy day_of_week training skew: reward endpoint now uses actual dow instead of hardcoded 0
- Fix runner.py proxy bypass: httpx.Client(trust_env=False) for localhost ML calls
- Add dwellMs to TipFeedbackEvent contract and bus.test.ts fixture
- Schema: sim_runs, sim_events tables; tip_feedback gains dwell_ms, reward_milli columns
- ADR-0006: admin console framework; ADR-0007: egreedy-v1 policy selection rationale

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 07:44:37 +00:00

48 lines
2.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0007: ε-greedy v1 as the active recommendation policy
## Status
Accepted — 2026-04-16
## Context
M1 shipped LinUCB (d=5, α=1.0) as the first learned policy via `ml/serving /score`. After the M1 admin console landed, we ran an offline simulation to compare LinUCB against a new ε-greedy ridge-regression policy before deciding which to keep live.
**ε-greedy v1 design:**
- Ridge regression estimator, θ updated online (equivalent to LinUCB without the UCB bonus).
- d=7 feature vector: base 5 (is\_overdue, task\_age\_days, priority, hour\_of\_day, bias) + sin/cos encoding of day\_of\_week.
- ε=0.10 random exploration; 90% argmax(θ·x).
- Separate per-user state files (`{user}_egreedy.json`), independent of LinUCB state.
**Simulation setup (rule judge, seed=42):**
- 5 synthetic personas × 20 rounds × 8 tasks/round = 100 judgments per policy.
- Reward inferred from dwell-time (same `inferReward` logic as production): dismiss=1, snooze=+0.1, done<15 s=0.3, done 15 s2 min=+1.0, done 210 min=+0.6, done>10 min=+0.3.
- Both policies started from blank state (no warm-up).
**Results:**
| Policy | Total reward | Mean reward/pull | Pulls |
|--------|-------------|-----------------|-------|
| egreedy-v1 | 54.80 | 0.548 | 100 |
| linucb-v1 | 60.60 | 0.606 | 100 |
Winner: **egreedy-v1** (+10.7% mean reward).
Both policies produce negative mean rewards under the dwell-time model — expected: most simulated users don't act in the 15s2min magic zone on cold models. The gap widens from round 8 onward, consistent with LinUCB's UCB exploration bonus over-favouring high-uncertainty dimensions (is\_overdue, task\_age\_days) regardless of persona fit.
## Decision
Promote **egreedy-v1** to the active serving policy:
- `POST /recommend` calls `/score/egreedy` instead of `/score`.
- Feedback loop calls `/reward/egreedy`.
- LinUCB (`/score`, `/reward`) remains deployed in `ml/serving` as a shadow-eligible fallback.
The simulation does not replace online A/B testing; it is evidence that egreedy-v1 is worth promoting before collecting real-user signal. A future milestone will run live A/B once we have enough daily active users for statistical power.
## Consequences
- Recommendation calls and reward updates now hit the egreedy endpoints only.
- LinUCB state is preserved on disk; re-activation is a one-line change.
- `tip_scores.policy` will log `egreedy-v1` for new serves; historical rows remain `linucb-v1` or `random`.
- The dwell-time reward model (`inferReward`) is now the canonical feedback signal for both online updates and simulation. Explicit helpful/not\_helpful signals are removed.
- Next evaluation gate: once ≥500 real tips served with egreedy-v1, compare reward distribution to the LinUCB historical baseline in the admin Reward Analytics page before deciding on next policy iteration.