# ADR-0012 — ε-greedy v2: profile features in the bandit (D=7→12) **Status:** Promoted **Date:** 2026-04-25 (accepted) / 2026-04-26 (promoted) **Issue:** #99 ## Context ADR-0011 shipped a 5-feature user-profile registry (completion rate, dismiss rate, mean dwell, preferred hour, tip volume). `POST /score` and `POST /score/egreedy` already receive a `profile_features` dict on every call but **ignore it** — the comment in `ml/serving/main.py` explains why: extending the feature vector changes `D`, which resets every user's learned `A`/`b` matrices and discards accumulated signal. That loss requires a deliberate shadow-first rollout per ADR-0002, not an in-place update. This ADR authorises `egreedy-v2`, which extends the active `egreedy-v1` (D=7) with the 5 profile features (D=12) and defines how it ships safely. ## Decision ### New policy: egreedy-v2 (D=12) Feature vector layout: | idx | name | encoding | |-----|------|----------| | 0–1 | hour_sin, hour_cos | cyclical, current hour | | 2 | is_overdue | 0/1 | | 3 | task_age_norm | age_days / 30, clipped 0–1 | | 4 | priority_norm | (p − 1) / 3 | | 5–6 | dow_sin, dow_cos | cyclical, day of week | | 7 | completion_rate_30d | raw (already 0–1); null → 0 | | 8 | dismiss_rate_30d | raw (already 0–1); null → 0 | | 9 | mean_dwell_norm | dwell_ms / 600_000, clipped 0–1; null → 0 | | 10 | preferred_hour_alignment | `(cos(2π(pref − now)/24) + 1) / 2`; null → 0.5 (neutral) | | 11 | tip_volume_norm | `log1p(n) / log1p(100)`, clipped 0–1; null → 0 | **Normalization rationale:** - Rates are already in [0, 1]; no transform needed. - Dwell clips at 10 min — anything beyond that carries diminishing signal. - `preferred_hour` needs circular continuity; one-dimension approximation using cosine alignment with the current hour. At null (no established peak) we use 0.5 (the midpoint/neutral) rather than 0 (misleading "polar-opposite hour"). - `tip_volume` uses log-scale because engagement counts are heavy-tailed. ### Rollout sequence (per ADR-0002) 1. **Shadow** (this ADR) — `egreedy-v2-shadow` registered in the recommender's shadow-policy map (disabled by default). Admin enables via `/admin/policies`. - Calls `/score/egreedy/v2` fire-and-forget alongside the active `egreedy-v1` call. - Publishes `signals.tip.served` with `policy: shadow:egreedy-v2-shadow` for logging. - **No reward delivery to shadow** — live shadow collects decision-agreement exposure only; reward measurement uses offline simulation. - State files: `{user}_egreedy_v2.json` — isolated from v1's `{user}_egreedy.json`. 2. **Offline sim** — run `runner.py --policies egreedy-v1 egreedy-v2 --n-rounds 20` using the `rule` judge and persona-level profile features (synthetic values in `personas.py`). Gate: v2 mean reward ≥ v1 mean reward. 3. **Promote** — if sim gate passes, change the `remotePolicy()` call in `recommender.ts` from `/score/egreedy` to `/score/egreedy/v2` and change reward delivery to `/reward/egreedy/v2`. No DB migration; old per-user v1 state files are left on disk (available for rollback; clean up after 30 days). ### State-file migration No migration of `A`/`b` matrices from v1 → v2. A D×D→D'×D' transform would require assumptions about the new dimensions that we cannot justify without data. v2 starts from the identity prior and learns from scratch in shadow/sim. The reward penalty from cold-start is the correct price for the dimension extension. ### Admin control `GET /api/admin/policies` surfaces `egreedy-v2-shadow` with `active: false`. Toggle via `POST /api/admin/policies/egreedy-v2-shadow/toggle`. ## Consequences **Good:** - Profile features (preferred hour, completion/dismiss rates, volume) allow the bandit to personalise timing recommendations beyond what the candidate-level features encode. - Normalization is deterministic, bounded [0, 1], and numerically stable; no scaling artefacts as the population grows. - Shadow-first rollout protects real users from a cold-start regression. **Trade-offs:** - Cold-start: v2 state files begin from the identity prior. During shadow, v2 makes random-ish decisions for early users. This is expected and intentional. - Synthetic persona profiles in `personas.py` approximate real user distributions; the offline sim is evidence, not proof. The promotion gate requires the sim to run after v2 has accumulated enough behavioral data (suggest ≥100 shadow calls per policy per user before running the final sim). - The one-dim preferred-hour encoding loses some circular information compared to two-dim sin/cos. If preferred-hour alignment becomes a dominant signal, revisit with D=13 in a follow-up ADR. ## Alternatives considered **Warm-start via projection** — project v1's 7-dim theta into D=12 by padding with zeros. Rejected: zero initialization for the profile dims is equivalent, and projecting theta without the corresponding `A` matrix cannot be done correctly. **D=13 with two preferred-hour dims** — cleaner circular encoding, but contradicts the D=12 target in the issue spec and complicates the sim comparison. Deferred. **In-place v1 promotion without shadow** — violates ADR-0002. ## Promotion record (2026-04-26) Offline sim (`runner.py --policies egreedy-v1 egreedy-v2 --judge rule --n-users 5 --n-rounds 20 --seed 42`): | policy | total reward | mean reward | pulls | |--------|-------------|-------------|-------| | egreedy-v1 | −64.20 | −0.6420 | 100 | | egreedy-v2 | −62.90 | −0.6290 | 100 | **Gate passed** (v2 mean ≥ v1 mean). Per-persona: v2 wins deadline-driven, evening-relaxed, low-priority-first; v1 wins consistent-responder, overdue-ignorer. Changes applied: - `recommender.ts` `remotePolicy()`: `/score/egreedy` → `/score/egreedy/v2` - `recommender.ts` `sendRewardWithRetry()`: `/reward/egreedy` → `/reward/egreedy/v2`, added `profile_features` to payload - Shadow entry `egreedy-v2-shadow` left in registry (`active: false`) for rollback.