Files
oO/docs/adr/0012-egreedy-v2-profile-features.md
alvis 7281af83a4 feat(bandit): promote egreedy-v2 (D=12, profile features) as active policy (#99)
Offline sim gate passed — egreedy-v2 mean reward −0.629 vs egreedy-v1 −0.642
(5 users × 20 rounds, rule judge, seed 42). v2 wins 3/5 personas.

- recommender.ts: switch remotePolicy() to /score/egreedy/v2
- recommender.ts: switch sendRewardWithRetry() to /reward/egreedy/v2 with
  profile_features payload so the ridge update uses the full D=12 vector
- recommender.ts: re-fetch profile at feedback time (TTL-cached, near-instant)
- ADR-0012: status Accepted → Promoted, promotion record appended

Shadow entry egreedy-v2-shadow kept in registry (active: false) for rollback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 03:08:28 +00:00

125 lines
5.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0012 — ε-greedy v2: profile features in the bandit (D=7→12)
**Status:** Promoted
**Date:** 2026-04-25 (accepted) / 2026-04-26 (promoted)
**Issue:** #99
## Context
ADR-0011 shipped a 5-feature user-profile registry (completion rate, dismiss rate,
mean dwell, preferred hour, tip volume). `POST /score` and `POST /score/egreedy`
already receive a `profile_features` dict on every call but **ignore it** — the
comment in `ml/serving/main.py` explains why: extending the feature vector changes
`D`, which resets every user's learned `A`/`b` matrices and discards accumulated
signal. That loss requires a deliberate shadow-first rollout per ADR-0002, not an
in-place update.
This ADR authorises `egreedy-v2`, which extends the active `egreedy-v1` (D=7) with
the 5 profile features (D=12) and defines how it ships safely.
## Decision
### New policy: egreedy-v2 (D=12)
Feature vector layout:
| idx | name | encoding |
|-----|------|----------|
| 01 | hour_sin, hour_cos | cyclical, current hour |
| 2 | is_overdue | 0/1 |
| 3 | task_age_norm | age_days / 30, clipped 01 |
| 4 | priority_norm | (p 1) / 3 |
| 56 | dow_sin, dow_cos | cyclical, day of week |
| 7 | completion_rate_30d | raw (already 01); null → 0 |
| 8 | dismiss_rate_30d | raw (already 01); null → 0 |
| 9 | mean_dwell_norm | dwell_ms / 600_000, clipped 01; null → 0 |
| 10 | preferred_hour_alignment | `(cos(2π(pref now)/24) + 1) / 2`; null → 0.5 (neutral) |
| 11 | tip_volume_norm | `log1p(n) / log1p(100)`, clipped 01; null → 0 |
**Normalization rationale:**
- Rates are already in [0, 1]; no transform needed.
- Dwell clips at 10 min — anything beyond that carries diminishing signal.
- `preferred_hour` needs circular continuity; one-dimension approximation using
cosine alignment with the current hour. At null (no established peak) we use
0.5 (the midpoint/neutral) rather than 0 (misleading "polar-opposite hour").
- `tip_volume` uses log-scale because engagement counts are heavy-tailed.
### Rollout sequence (per ADR-0002)
1. **Shadow** (this ADR) — `egreedy-v2-shadow` registered in the recommender's
shadow-policy map (disabled by default). Admin enables via `/admin/policies`.
- Calls `/score/egreedy/v2` fire-and-forget alongside the active `egreedy-v1` call.
- Publishes `signals.tip.served` with `policy: shadow:egreedy-v2-shadow` for logging.
- **No reward delivery to shadow** — live shadow collects decision-agreement
exposure only; reward measurement uses offline simulation.
- State files: `{user}_egreedy_v2.json` — isolated from v1's `{user}_egreedy.json`.
2. **Offline sim** — run `runner.py --policies egreedy-v1 egreedy-v2 --n-rounds 20`
using the `rule` judge and persona-level profile features (synthetic values in
`personas.py`). Gate: v2 mean reward ≥ v1 mean reward.
3. **Promote** — if sim gate passes, change the `remotePolicy()` call in
`recommender.ts` from `/score/egreedy` to `/score/egreedy/v2` and change reward
delivery to `/reward/egreedy/v2`. No DB migration; old per-user v1 state files
are left on disk (available for rollback; clean up after 30 days).
### State-file migration
No migration of `A`/`b` matrices from v1 → v2. A D×D→D'×D' transform would
require assumptions about the new dimensions that we cannot justify without data.
v2 starts from the identity prior and learns from scratch in shadow/sim. The reward
penalty from cold-start is the correct price for the dimension extension.
### Admin control
`GET /api/admin/policies` surfaces `egreedy-v2-shadow` with `active: false`.
Toggle via `POST /api/admin/policies/egreedy-v2-shadow/toggle`.
## Consequences
**Good:**
- Profile features (preferred hour, completion/dismiss rates, volume) allow the
bandit to personalise timing recommendations beyond what the candidate-level
features encode.
- Normalization is deterministic, bounded [0, 1], and numerically stable; no
scaling artefacts as the population grows.
- Shadow-first rollout protects real users from a cold-start regression.
**Trade-offs:**
- Cold-start: v2 state files begin from the identity prior. During shadow,
v2 makes random-ish decisions for early users. This is expected and intentional.
- Synthetic persona profiles in `personas.py` approximate real user distributions;
the offline sim is evidence, not proof. The promotion gate requires the sim to
run after v2 has accumulated enough behavioral data (suggest ≥100 shadow calls
per policy per user before running the final sim).
- The one-dim preferred-hour encoding loses some circular information compared to
two-dim sin/cos. If preferred-hour alignment becomes a dominant signal, revisit
with D=13 in a follow-up ADR.
## Alternatives considered
**Warm-start via projection** — project v1's 7-dim theta into D=12 by padding
with zeros. Rejected: zero initialization for the profile dims is equivalent, and
projecting theta without the corresponding `A` matrix cannot be done correctly.
**D=13 with two preferred-hour dims** — cleaner circular encoding, but contradicts
the D=12 target in the issue spec and complicates the sim comparison. Deferred.
**In-place v1 promotion without shadow** — violates ADR-0002.
## Promotion record (2026-04-26)
Offline sim (`runner.py --policies egreedy-v1 egreedy-v2 --judge rule --n-users 5 --n-rounds 20 --seed 42`):
| policy | total reward | mean reward | pulls |
|--------|-------------|-------------|-------|
| egreedy-v1 | 64.20 | 0.6420 | 100 |
| egreedy-v2 | 62.90 | 0.6290 | 100 |
**Gate passed** (v2 mean ≥ v1 mean). Per-persona: v2 wins deadline-driven, evening-relaxed, low-priority-first; v1 wins consistent-responder, overdue-ignorer.
Changes applied:
- `recommender.ts` `remotePolicy()`: `/score/egreedy``/score/egreedy/v2`
- `recommender.ts` `sendRewardWithRetry()`: `/reward/egreedy``/reward/egreedy/v2`, added `profile_features` to payload
- Shadow entry `egreedy-v2-shadow` left in registry (`active: false`) for rollback.