- Promote egreedy-v1 to active serving policy (ADR-0007): /score/egreedy + /reward/egreedy
replaces linucb-v1 endpoints after offline sim shows +10.7% mean reward (−0.548 vs −0.606)
- Replace explicit helpful/not_helpful feedback with dwell-time inferred reward (inferReward):
dismiss=−1.0, snooze=+0.1, done<15s=−0.3, done 15s–2min=+1.0, done 2–10min=+0.6, done>10min=+0.3
- Add ml/serving ε-greedy endpoints: /score/egreedy, /reward/egreedy, /stats/egreedy/{user_id}
with d=7 feature vector (base 5 + sin/cos day-of-week encoding)
- Add offline simulation framework (ml/experiments/sim): rule/LLM/claude-code judges,
two-phase score+reward, synthetic personas, task generator; results stored in sim_runs/sim_events
- Add /admin/simulations page: start runs, live-poll status, reward curve SVG, action/persona tables
- Fix egreedy day_of_week training skew: reward endpoint now uses actual dow instead of hardcoded 0
- Fix runner.py proxy bypass: httpx.Client(trust_env=False) for localhost ML calls
- Add dwellMs to TipFeedbackEvent contract and bus.test.ts fixture
- Schema: sim_runs, sim_events tables; tip_feedback gains dwell_ms, reward_milli columns
- ADR-0006: admin console framework; ADR-0007: egreedy-v1 policy selection rationale
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2.8 KiB
ADR-0007: ε-greedy v1 as the active recommendation policy
Status
Accepted — 2026-04-16
Context
M1 shipped LinUCB (d=5, α=1.0) as the first learned policy via ml/serving /score. After the M1 admin console landed, we ran an offline simulation to compare LinUCB against a new ε-greedy ridge-regression policy before deciding which to keep live.
ε-greedy v1 design:
- Ridge regression estimator, θ updated online (equivalent to LinUCB without the UCB bonus).
- d=7 feature vector: base 5 (is_overdue, task_age_days, priority, hour_of_day, bias) + sin/cos encoding of day_of_week.
- ε=0.10 random exploration; 90% argmax(θ·x).
- Separate per-user state files (
{user}_egreedy.json), independent of LinUCB state.
Simulation setup (rule judge, seed=42):
- 5 synthetic personas × 20 rounds × 8 tasks/round = 100 judgments per policy.
- Reward inferred from dwell-time (same
inferRewardlogic as production): dismiss=−1, snooze=+0.1, done<15 s=−0.3, done 15 s–2 min=+1.0, done 2–10 min=+0.6, done>10 min=+0.3. - Both policies started from blank state (no warm-up).
Results:
| Policy | Total reward | Mean reward/pull | Pulls |
|---|---|---|---|
| egreedy-v1 | −54.80 | −0.548 | 100 |
| linucb-v1 | −60.60 | −0.606 | 100 |
Winner: egreedy-v1 (+10.7% mean reward).
Both policies produce negative mean rewards under the dwell-time model — expected: most simulated users don't act in the 15s–2min magic zone on cold models. The gap widens from round 8 onward, consistent with LinUCB's UCB exploration bonus over-favouring high-uncertainty dimensions (is_overdue, task_age_days) regardless of persona fit.
Decision
Promote egreedy-v1 to the active serving policy:
POST /recommendcalls/score/egreedyinstead of/score.- Feedback loop calls
/reward/egreedy. - LinUCB (
/score,/reward) remains deployed inml/servingas a shadow-eligible fallback.
The simulation does not replace online A/B testing; it is evidence that egreedy-v1 is worth promoting before collecting real-user signal. A future milestone will run live A/B once we have enough daily active users for statistical power.
Consequences
- Recommendation calls and reward updates now hit the egreedy endpoints only.
- LinUCB state is preserved on disk; re-activation is a one-line change.
tip_scores.policywill logegreedy-v1for new serves; historical rows remainlinucb-v1orrandom.- The dwell-time reward model (
inferReward) is now the canonical feedback signal for both online updates and simulation. Explicit helpful/not_helpful signals are removed. - Next evaluation gate: once ≥500 real tips served with egreedy-v1, compare reward distribution to the LinUCB historical baseline in the admin Reward Analytics page before deciding on next policy iteration.