Files
oO/docs/adr/0007-egreedy-v1-active-policy.md
alvis 37aec4fee1 chore: ADR-0007/0012 superseded status + admin users ID column
ADR-0007 and ADR-0012 both superseded by ADR-0013 as of 2026-05-01.
UsersTable gains a truncated ID column for quick user identification.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 10:20:44 +00:00

2.8 KiB
Raw Blame History

ADR-0007: ε-greedy v1 as the active recommendation policy

Status

Superseded by ADR-0013 — 2026-05-01

Context

M1 shipped LinUCB (d=5, α=1.0) as the first learned policy via ml/serving /score. After the M1 admin console landed, we ran an offline simulation to compare LinUCB against a new ε-greedy ridge-regression policy before deciding which to keep live.

ε-greedy v1 design:

  • Ridge regression estimator, θ updated online (equivalent to LinUCB without the UCB bonus).
  • d=7 feature vector: base 5 (is_overdue, task_age_days, priority, hour_of_day, bias) + sin/cos encoding of day_of_week.
  • ε=0.10 random exploration; 90% argmax(θ·x).
  • Separate per-user state files ({user}_egreedy.json), independent of LinUCB state.

Simulation setup (rule judge, seed=42):

  • 5 synthetic personas × 20 rounds × 8 tasks/round = 100 judgments per policy.
  • Reward inferred from dwell-time (same inferReward logic as production): dismiss=1, snooze=+0.1, done<15 s=0.3, done 15 s2 min=+1.0, done 210 min=+0.6, done>10 min=+0.3.
  • Both policies started from blank state (no warm-up).

Results:

Policy Total reward Mean reward/pull Pulls
egreedy-v1 54.80 0.548 100
linucb-v1 60.60 0.606 100

Winner: egreedy-v1 (+10.7% mean reward).

Both policies produce negative mean rewards under the dwell-time model — expected: most simulated users don't act in the 15s2min magic zone on cold models. The gap widens from round 8 onward, consistent with LinUCB's UCB exploration bonus over-favouring high-uncertainty dimensions (is_overdue, task_age_days) regardless of persona fit.

Decision

Promote egreedy-v1 to the active serving policy:

  • POST /recommend calls /score/egreedy instead of /score.
  • Feedback loop calls /reward/egreedy.
  • LinUCB (/score, /reward) remains deployed in ml/serving as a shadow-eligible fallback.

The simulation does not replace online A/B testing; it is evidence that egreedy-v1 is worth promoting before collecting real-user signal. A future milestone will run live A/B once we have enough daily active users for statistical power.

Consequences

  • Recommendation calls and reward updates now hit the egreedy endpoints only.
  • LinUCB state is preserved on disk; re-activation is a one-line change.
  • tip_scores.policy will log egreedy-v1 for new serves; historical rows remain linucb-v1 or random.
  • The dwell-time reward model (inferReward) is now the canonical feedback signal for both online updates and simulation. Explicit helpful/not_helpful signals are removed.
  • Next evaluation gate: once ≥500 real tips served with egreedy-v1, compare reward distribution to the LinUCB historical baseline in the admin Reward Analytics page before deciding on next policy iteration.