ADR-0007 and ADR-0012 both superseded by ADR-0013 as of 2026-05-01. UsersTable gains a truncated ID column for quick user identification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5.9 KiB
ADR-0012 — ε-greedy v2: profile features in the bandit (D=7→12)
Status: Superseded by ADR-0013 — 2026-05-01
Date: 2026-04-25 (accepted) / 2026-04-26 (promoted)
Issue: #99
Context
ADR-0011 shipped a 5-feature user-profile registry (completion rate, dismiss rate,
mean dwell, preferred hour, tip volume). POST /score and POST /score/egreedy
already receive a profile_features dict on every call but ignore it — the
comment in ml/serving/main.py explains why: extending the feature vector changes
D, which resets every user's learned A/b matrices and discards accumulated
signal. That loss requires a deliberate shadow-first rollout per ADR-0002, not an
in-place update.
This ADR authorises egreedy-v2, which extends the active egreedy-v1 (D=7) with
the 5 profile features (D=12) and defines how it ships safely.
Decision
New policy: egreedy-v2 (D=12)
Feature vector layout:
| idx | name | encoding |
|---|---|---|
| 0–1 | hour_sin, hour_cos | cyclical, current hour |
| 2 | is_overdue | 0/1 |
| 3 | task_age_norm | age_days / 30, clipped 0–1 |
| 4 | priority_norm | (p − 1) / 3 |
| 5–6 | dow_sin, dow_cos | cyclical, day of week |
| 7 | completion_rate_30d | raw (already 0–1); null → 0 |
| 8 | dismiss_rate_30d | raw (already 0–1); null → 0 |
| 9 | mean_dwell_norm | dwell_ms / 600_000, clipped 0–1; null → 0 |
| 10 | preferred_hour_alignment | (cos(2π(pref − now)/24) + 1) / 2; null → 0.5 (neutral) |
| 11 | tip_volume_norm | log1p(n) / log1p(100), clipped 0–1; null → 0 |
Normalization rationale:
- Rates are already in [0, 1]; no transform needed.
- Dwell clips at 10 min — anything beyond that carries diminishing signal.
preferred_hourneeds circular continuity; one-dimension approximation using cosine alignment with the current hour. At null (no established peak) we use 0.5 (the midpoint/neutral) rather than 0 (misleading "polar-opposite hour").tip_volumeuses log-scale because engagement counts are heavy-tailed.
Rollout sequence (per ADR-0002)
-
Shadow (this ADR) —
egreedy-v2-shadowregistered in the recommender's shadow-policy map (disabled by default). Admin enables via/admin/policies.- Calls
/score/egreedy/v2fire-and-forget alongside the activeegreedy-v1call. - Publishes
signals.tip.servedwithpolicy: shadow:egreedy-v2-shadowfor logging. - No reward delivery to shadow — live shadow collects decision-agreement exposure only; reward measurement uses offline simulation.
- State files:
{user}_egreedy_v2.json— isolated from v1's{user}_egreedy.json.
- Calls
-
Offline sim — run
runner.py --policies egreedy-v1 egreedy-v2 --n-rounds 20using therulejudge and persona-level profile features (synthetic values inpersonas.py). Gate: v2 mean reward ≥ v1 mean reward. -
Promote — if sim gate passes, change the
remotePolicy()call inrecommender.tsfrom/score/egreedyto/score/egreedy/v2and change reward delivery to/reward/egreedy/v2. No DB migration; old per-user v1 state files are left on disk (available for rollback; clean up after 30 days).
State-file migration
No migration of A/b matrices from v1 → v2. A D×D→D'×D' transform would
require assumptions about the new dimensions that we cannot justify without data.
v2 starts from the identity prior and learns from scratch in shadow/sim. The reward
penalty from cold-start is the correct price for the dimension extension.
Admin control
GET /api/admin/policies surfaces egreedy-v2-shadow with active: false.
Toggle via POST /api/admin/policies/egreedy-v2-shadow/toggle.
Consequences
Good:
- Profile features (preferred hour, completion/dismiss rates, volume) allow the bandit to personalise timing recommendations beyond what the candidate-level features encode.
- Normalization is deterministic, bounded [0, 1], and numerically stable; no scaling artefacts as the population grows.
- Shadow-first rollout protects real users from a cold-start regression.
Trade-offs:
- Cold-start: v2 state files begin from the identity prior. During shadow, v2 makes random-ish decisions for early users. This is expected and intentional.
- Synthetic persona profiles in
personas.pyapproximate real user distributions; the offline sim is evidence, not proof. The promotion gate requires the sim to run after v2 has accumulated enough behavioral data (suggest ≥100 shadow calls per policy per user before running the final sim). - The one-dim preferred-hour encoding loses some circular information compared to two-dim sin/cos. If preferred-hour alignment becomes a dominant signal, revisit with D=13 in a follow-up ADR.
Alternatives considered
Warm-start via projection — project v1's 7-dim theta into D=12 by padding
with zeros. Rejected: zero initialization for the profile dims is equivalent, and
projecting theta without the corresponding A matrix cannot be done correctly.
D=13 with two preferred-hour dims — cleaner circular encoding, but contradicts the D=12 target in the issue spec and complicates the sim comparison. Deferred.
In-place v1 promotion without shadow — violates ADR-0002.
Promotion record (2026-04-26)
Offline sim (runner.py --policies egreedy-v1 egreedy-v2 --judge rule --n-users 5 --n-rounds 20 --seed 42):
| policy | total reward | mean reward | pulls |
|---|---|---|---|
| egreedy-v1 | −64.20 | −0.6420 | 100 |
| egreedy-v2 | −62.90 | −0.6290 | 100 |
Gate passed (v2 mean ≥ v1 mean). Per-persona: v2 wins deadline-driven, evening-relaxed, low-priority-first; v1 wins consistent-responder, overdue-ignorer.
Changes applied:
recommender.tsremotePolicy():/score/egreedy→/score/egreedy/v2recommender.tssendRewardWithRetry():/reward/egreedy→/reward/egreedy/v2, addedprofile_featuresto payload- Shadow entry
egreedy-v2-shadowleft in registry (active: false) for rollback.