Files
oO/docs/adr/0012-egreedy-v2-profile-features.md
alvis 2d7cf217a9 feat(ml): egreedy-v2 shadow policy — D=12 with profile features (#99)
Ship the scaffolding for #99 (phase B.3 of #81):

- ml/serving: add /score/egreedy/v2, /reward/egreedy/v2, /stats/egreedy/v2
  endpoints (D=12). New feature dims: completion/dismiss rates, mean dwell
  (clipped 10min), preferred-hour alignment (cosine, 1-dim), tip volume (log).
  Separate state file per user (_egreedy_v2.json). /reset clears v2 state too.
- ADR-0012: documents D=7→12 dimension change, normalization choices, shadow
  rollout protocol, and promotion gate (offline sim win per ADR-0002).
- recommender.ts: register egreedy-v2-shadow in shadow-policy map (disabled by
  default). When enabled, calls /score/egreedy/v2 fire-and-forget and publishes
  shadow:egreedy-v2-shadow serve signal. No reward to shadow — sim is the gate.
- sim runner/personas: personas carry synthetic profile_features per persona;
  _call_score/_call_reward thread profile_features through (None-safe for v1/linucb).
- 18 new Python tests; all 56 Python + 170 TS tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 10:00:38 +00:00

109 lines
5.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0012 — ε-greedy v2: profile features in the bandit (D=7→12)
**Status:** Accepted
**Date:** 2026-04-25
**Issue:** #99
## Context
ADR-0011 shipped a 5-feature user-profile registry (completion rate, dismiss rate,
mean dwell, preferred hour, tip volume). `POST /score` and `POST /score/egreedy`
already receive a `profile_features` dict on every call but **ignore it** — the
comment in `ml/serving/main.py` explains why: extending the feature vector changes
`D`, which resets every user's learned `A`/`b` matrices and discards accumulated
signal. That loss requires a deliberate shadow-first rollout per ADR-0002, not an
in-place update.
This ADR authorises `egreedy-v2`, which extends the active `egreedy-v1` (D=7) with
the 5 profile features (D=12) and defines how it ships safely.
## Decision
### New policy: egreedy-v2 (D=12)
Feature vector layout:
| idx | name | encoding |
|-----|------|----------|
| 01 | hour_sin, hour_cos | cyclical, current hour |
| 2 | is_overdue | 0/1 |
| 3 | task_age_norm | age_days / 30, clipped 01 |
| 4 | priority_norm | (p 1) / 3 |
| 56 | dow_sin, dow_cos | cyclical, day of week |
| 7 | completion_rate_30d | raw (already 01); null → 0 |
| 8 | dismiss_rate_30d | raw (already 01); null → 0 |
| 9 | mean_dwell_norm | dwell_ms / 600_000, clipped 01; null → 0 |
| 10 | preferred_hour_alignment | `(cos(2π(pref now)/24) + 1) / 2`; null → 0.5 (neutral) |
| 11 | tip_volume_norm | `log1p(n) / log1p(100)`, clipped 01; null → 0 |
**Normalization rationale:**
- Rates are already in [0, 1]; no transform needed.
- Dwell clips at 10 min — anything beyond that carries diminishing signal.
- `preferred_hour` needs circular continuity; one-dimension approximation using
cosine alignment with the current hour. At null (no established peak) we use
0.5 (the midpoint/neutral) rather than 0 (misleading "polar-opposite hour").
- `tip_volume` uses log-scale because engagement counts are heavy-tailed.
### Rollout sequence (per ADR-0002)
1. **Shadow** (this ADR) — `egreedy-v2-shadow` registered in the recommender's
shadow-policy map (disabled by default). Admin enables via `/admin/policies`.
- Calls `/score/egreedy/v2` fire-and-forget alongside the active `egreedy-v1` call.
- Publishes `signals.tip.served` with `policy: shadow:egreedy-v2-shadow` for logging.
- **No reward delivery to shadow** — live shadow collects decision-agreement
exposure only; reward measurement uses offline simulation.
- State files: `{user}_egreedy_v2.json` — isolated from v1's `{user}_egreedy.json`.
2. **Offline sim** — run `runner.py --policies egreedy-v1 egreedy-v2 --n-rounds 20`
using the `rule` judge and persona-level profile features (synthetic values in
`personas.py`). Gate: v2 mean reward ≥ v1 mean reward.
3. **Promote** — if sim gate passes, change the `remotePolicy()` call in
`recommender.ts` from `/score/egreedy` to `/score/egreedy/v2` and change reward
delivery to `/reward/egreedy/v2`. No DB migration; old per-user v1 state files
are left on disk (available for rollback; clean up after 30 days).
### State-file migration
No migration of `A`/`b` matrices from v1 → v2. A D×D→D'×D' transform would
require assumptions about the new dimensions that we cannot justify without data.
v2 starts from the identity prior and learns from scratch in shadow/sim. The reward
penalty from cold-start is the correct price for the dimension extension.
### Admin control
`GET /api/admin/policies` surfaces `egreedy-v2-shadow` with `active: false`.
Toggle via `POST /api/admin/policies/egreedy-v2-shadow/toggle`.
## Consequences
**Good:**
- Profile features (preferred hour, completion/dismiss rates, volume) allow the
bandit to personalise timing recommendations beyond what the candidate-level
features encode.
- Normalization is deterministic, bounded [0, 1], and numerically stable; no
scaling artefacts as the population grows.
- Shadow-first rollout protects real users from a cold-start regression.
**Trade-offs:**
- Cold-start: v2 state files begin from the identity prior. During shadow,
v2 makes random-ish decisions for early users. This is expected and intentional.
- Synthetic persona profiles in `personas.py` approximate real user distributions;
the offline sim is evidence, not proof. The promotion gate requires the sim to
run after v2 has accumulated enough behavioral data (suggest ≥100 shadow calls
per policy per user before running the final sim).
- The one-dim preferred-hour encoding loses some circular information compared to
two-dim sin/cos. If preferred-hour alignment becomes a dominant signal, revisit
with D=13 in a follow-up ADR.
## Alternatives considered
**Warm-start via projection** — project v1's 7-dim theta into D=12 by padding
with zeros. Rejected: zero initialization for the profile dims is equivalent, and
projecting theta without the corresponding `A` matrix cannot be done correctly.
**D=13 with two preferred-hour dims** — cleaner circular encoding, but contradicts
the D=12 target in the issue spec and complicates the sim comparison. Deferred.
**In-place v1 promotion without shadow** — violates ADR-0002.