feat(ml): egreedy-v2 shadow policy — D=12 with profile features (#99)

Ship the scaffolding for #99 (phase B.3 of #81): - ml/serving: add /score/egreedy/v2, /reward/egreedy/v2, /stats/egreedy/v2 endpoints (D=12). New feature dims: completion/dismiss rates, mean dwell (clipped 10min), preferred-hour alignment (cosine, 1-dim), tip volume (log). Separate state file per user (_egreedy_v2.json). /reset clears v2 state too. - ADR-0012: documents D=7→12 dimension change, normalization choices, shadow rollout protocol, and promotion gate (offline sim win per ADR-0002). - recommender.ts: register egreedy-v2-shadow in shadow-policy map (disabled by default). When enabled, calls /score/egreedy/v2 fire-and-forget and publishes shadow:egreedy-v2-shadow serve signal. No reward to shadow — sim is the gate. - sim runner/personas: personas carry synthetic profile_features per persona; _call_score/_call_reward thread profile_features through (None-safe for v1/linucb). - 18 new Python tests; all 56 Python + 170 TS tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 10:00:38 +00:00
parent b8113d4bda
commit 2d7cf217a9
6 changed files with 629 additions and 20 deletions
--- a/docs/adr/0012-egreedy-v2-profile-features.md
+++ b/docs/adr/0012-egreedy-v2-profile-features.md
@@ -0,0 +1,108 @@
+# ADR-0012 — ε-greedy v2: profile features in the bandit (D=7→12)
+
+**Status:** Accepted  
+**Date:** 2026-04-25  
+**Issue:** #99
+
+## Context
+
+ADR-0011 shipped a 5-feature user-profile registry (completion rate, dismiss rate,
+mean dwell, preferred hour, tip volume). `POST /score` and `POST /score/egreedy`
+already receive a `profile_features` dict on every call but **ignore it** — the
+comment in `ml/serving/main.py` explains why: extending the feature vector changes
+`D`, which resets every user's learned `A`/`b` matrices and discards accumulated
+signal. That loss requires a deliberate shadow-first rollout per ADR-0002, not an
+in-place update.
+
+This ADR authorises `egreedy-v2`, which extends the active `egreedy-v1` (D=7) with
+the 5 profile features (D=12) and defines how it ships safely.
+
+## Decision
+
+### New policy: egreedy-v2 (D=12)
+
+Feature vector layout:
+
+| idx | name | encoding |
+|-----|------|----------|
+| 0–1 | hour_sin, hour_cos | cyclical, current hour |
+| 2 | is_overdue | 0/1 |
+| 3 | task_age_norm | age_days / 30, clipped 0–1 |
+| 4 | priority_norm | (p − 1) / 3 |
+| 5–6 | dow_sin, dow_cos | cyclical, day of week |
+| 7 | completion_rate_30d | raw (already 0–1); null → 0 |
+| 8 | dismiss_rate_30d | raw (already 0–1); null → 0 |
+| 9 | mean_dwell_norm | dwell_ms / 600_000, clipped 0–1; null → 0 |
+| 10 | preferred_hour_alignment | `(cos(2π(pref − now)/24) + 1) / 2`; null → 0.5 (neutral) |
+| 11 | tip_volume_norm | `log1p(n) / log1p(100)`, clipped 0–1; null → 0 |
+
+**Normalization rationale:**
+- Rates are already in [0, 1]; no transform needed.
+- Dwell clips at 10 min — anything beyond that carries diminishing signal.
+- `preferred_hour` needs circular continuity; one-dimension approximation using
+  cosine alignment with the current hour. At null (no established peak) we use
+  0.5 (the midpoint/neutral) rather than 0 (misleading "polar-opposite hour").
+- `tip_volume` uses log-scale because engagement counts are heavy-tailed.
+
+### Rollout sequence (per ADR-0002)
+
+1. **Shadow** (this ADR) — `egreedy-v2-shadow` registered in the recommender's
+   shadow-policy map (disabled by default). Admin enables via `/admin/policies`.
+   - Calls `/score/egreedy/v2` fire-and-forget alongside the active `egreedy-v1` call.
+   - Publishes `signals.tip.served` with `policy: shadow:egreedy-v2-shadow` for logging.
+   - **No reward delivery to shadow** — live shadow collects decision-agreement
+     exposure only; reward measurement uses offline simulation.
+   - State files: `{user}_egreedy_v2.json` — isolated from v1's `{user}_egreedy.json`.
+
+2. **Offline sim** — run `runner.py --policies egreedy-v1 egreedy-v2 --n-rounds 20`
+   using the `rule` judge and persona-level profile features (synthetic values in
+   `personas.py`). Gate: v2 mean reward ≥ v1 mean reward.
+
+3. **Promote** — if sim gate passes, change the `remotePolicy()` call in
+   `recommender.ts` from `/score/egreedy` to `/score/egreedy/v2` and change reward
+   delivery to `/reward/egreedy/v2`. No DB migration; old per-user v1 state files
+   are left on disk (available for rollback; clean up after 30 days).
+
+### State-file migration
+
+No migration of `A`/`b` matrices from v1 → v2. A D×D→D'×D' transform would
+require assumptions about the new dimensions that we cannot justify without data.
+v2 starts from the identity prior and learns from scratch in shadow/sim. The reward
+penalty from cold-start is the correct price for the dimension extension.
+
+### Admin control
+
+`GET /api/admin/policies` surfaces `egreedy-v2-shadow` with `active: false`.
+Toggle via `POST /api/admin/policies/egreedy-v2-shadow/toggle`.
+
+## Consequences
+
+**Good:**
+- Profile features (preferred hour, completion/dismiss rates, volume) allow the
+  bandit to personalise timing recommendations beyond what the candidate-level
+  features encode.
+- Normalization is deterministic, bounded [0, 1], and numerically stable; no
+  scaling artefacts as the population grows.
+- Shadow-first rollout protects real users from a cold-start regression.
+
+**Trade-offs:**
+- Cold-start: v2 state files begin from the identity prior. During shadow,
+  v2 makes random-ish decisions for early users. This is expected and intentional.
+- Synthetic persona profiles in `personas.py` approximate real user distributions;
+  the offline sim is evidence, not proof. The promotion gate requires the sim to
+  run after v2 has accumulated enough behavioral data (suggest ≥100 shadow calls
+  per policy per user before running the final sim).
+- The one-dim preferred-hour encoding loses some circular information compared to
+  two-dim sin/cos. If preferred-hour alignment becomes a dominant signal, revisit
+  with D=13 in a follow-up ADR.
+
+## Alternatives considered
+
+**Warm-start via projection** — project v1's 7-dim theta into D=12 by padding
+with zeros. Rejected: zero initialization for the profile dims is equivalent, and
+projecting theta without the corresponding `A` matrix cannot be done correctly.
+
+**D=13 with two preferred-hour dims** — cleaner circular encoding, but contradicts
+the D=12 target in the issue spec and complicates the sim comparison. Deferred.
+
+**In-place v1 promotion without shadow** — violates ADR-0002.