# ADR-0007: ε-greedy v1 as the active recommendation policy ## Status Accepted — 2026-04-16 ## Context M1 shipped LinUCB (d=5, α=1.0) as the first learned policy via `ml/serving /score`. After the M1 admin console landed, we ran an offline simulation to compare LinUCB against a new ε-greedy ridge-regression policy before deciding which to keep live. **ε-greedy v1 design:** - Ridge regression estimator, θ updated online (equivalent to LinUCB without the UCB bonus). - d=7 feature vector: base 5 (is\_overdue, task\_age\_days, priority, hour\_of\_day, bias) + sin/cos encoding of day\_of\_week. - ε=0.10 random exploration; 90% argmax(θ·x). - Separate per-user state files (`{user}_egreedy.json`), independent of LinUCB state. **Simulation setup (rule judge, seed=42):** - 5 synthetic personas × 20 rounds × 8 tasks/round = 100 judgments per policy. - Reward inferred from dwell-time (same `inferReward` logic as production): dismiss=−1, snooze=+0.1, done<15 s=−0.3, done 15 s–2 min=+1.0, done 2–10 min=+0.6, done>10 min=+0.3. - Both policies started from blank state (no warm-up). **Results:** | Policy | Total reward | Mean reward/pull | Pulls | |--------|-------------|-----------------|-------| | egreedy-v1 | −54.80 | −0.548 | 100 | | linucb-v1 | −60.60 | −0.606 | 100 | Winner: **egreedy-v1** (+10.7% mean reward). Both policies produce negative mean rewards under the dwell-time model — expected: most simulated users don't act in the 15s–2min magic zone on cold models. The gap widens from round 8 onward, consistent with LinUCB's UCB exploration bonus over-favouring high-uncertainty dimensions (is\_overdue, task\_age\_days) regardless of persona fit. ## Decision Promote **egreedy-v1** to the active serving policy: - `POST /recommend` calls `/score/egreedy` instead of `/score`. - Feedback loop calls `/reward/egreedy`. - LinUCB (`/score`, `/reward`) remains deployed in `ml/serving` as a shadow-eligible fallback. The simulation does not replace online A/B testing; it is evidence that egreedy-v1 is worth promoting before collecting real-user signal. A future milestone will run live A/B once we have enough daily active users for statistical power. ## Consequences - Recommendation calls and reward updates now hit the egreedy endpoints only. - LinUCB state is preserved on disk; re-activation is a one-line change. - `tip_scores.policy` will log `egreedy-v1` for new serves; historical rows remain `linucb-v1` or `random`. - The dwell-time reward model (`inferReward`) is now the canonical feedback signal for both online updates and simulation. Explicit helpful/not\_helpful signals are removed. - Next evaluation gate: once ≥500 real tips served with egreedy-v1, compare reward distribution to the LinUCB historical baseline in the admin Reward Analytics page before deciding on next policy iteration.