research: next-gen ranking policies — Thompson sampling, neural bandits, hybrid #83

Closed
opened 2026-04-16 15:26:11 +00:00 by alvis · 0 comments
Owner

Motivation

ε-greedy v1 is a strong baseline but has known limitations:

  • Fixed ε=0.10 (no decay, no adaptive exploration)
  • Linear reward model (ridge regression on d=7 features)
  • No cross-user learning (cold-start problem for new users)

Research directions

1. Thompson Sampling

  • Posterior sampling instead of ε-greedy exploration
  • Better exploration/exploitation tradeoff (information-directed)
  • Can use same linear model (Bayesian linear regression)
  • Benchmark: offline sim against ε-greedy v1

2. Neural Contextual Bandits

  • Replace linear ridge with a small NN (e.g. 2-layer MLP)
  • Captures non-linear feature interactions
  • Per-user fine-tuning from global model (transfer learning)
  • Risk: overfits with few observations per user

3. Hybrid: global model → per-user adaptation

  • Train a global model on all users (cold-start solution)
  • Fine-tune per-user as data accumulates
  • Smooth transition from population prior to personalized posterior

4. Explore-then-commit

  • Pure exploration for first N tips, then exploit
  • Simple baseline for new users

5. Reward model research

  • Learn personalized reward functions (not hardcoded dwell brackets)
  • Implicit signals: session duration, return frequency, time-of-day patterns
  • Multi-objective: helpfulness × timing × variety

Methodology

  • All experiments via offline sim framework (ml/experiments/sim/)
  • Compare on: mean reward, regret, cold-start performance, diversity
  • Promote winner via shadow → ADR → active (same process as ADR-0007)

Tasks

  • Implement Thompson sampling policy in ml/serving
  • Implement neural bandit (small MLP) policy
  • Implement global-then-personalize transfer policy
  • Extended offline sim: cold-start scenarios, diverse personas
  • Benchmark all policies against ε-greedy v1
  • ADR for next promoted policy
## Motivation ε-greedy v1 is a strong baseline but has known limitations: - Fixed ε=0.10 (no decay, no adaptive exploration) - Linear reward model (ridge regression on d=7 features) - No cross-user learning (cold-start problem for new users) ## Research directions ### 1. Thompson Sampling - Posterior sampling instead of ε-greedy exploration - Better exploration/exploitation tradeoff (information-directed) - Can use same linear model (Bayesian linear regression) - **Benchmark**: offline sim against ε-greedy v1 ### 2. Neural Contextual Bandits - Replace linear ridge with a small NN (e.g. 2-layer MLP) - Captures non-linear feature interactions - Per-user fine-tuning from global model (transfer learning) - **Risk**: overfits with few observations per user ### 3. Hybrid: global model → per-user adaptation - Train a global model on all users (cold-start solution) - Fine-tune per-user as data accumulates - Smooth transition from population prior to personalized posterior ### 4. Explore-then-commit - Pure exploration for first N tips, then exploit - Simple baseline for new users ### 5. Reward model research - Learn personalized reward functions (not hardcoded dwell brackets) - Implicit signals: session duration, return frequency, time-of-day patterns - Multi-objective: helpfulness × timing × variety ## Methodology - All experiments via offline sim framework (`ml/experiments/sim/`) - Compare on: mean reward, regret, cold-start performance, diversity - Promote winner via shadow → ADR → active (same process as ADR-0007) ## Tasks - [ ] Implement Thompson sampling policy in ml/serving - [ ] Implement neural bandit (small MLP) policy - [ ] Implement global-then-personalize transfer policy - [ ] Extended offline sim: cold-start scenarios, diverse personas - [ ] Benchmark all policies against ε-greedy v1 - [ ] ADR for next promoted policy
alvis added this to the M2 — AI tips + multi-source signals milestone 2026-04-16 15:26:11 +00:00
alvis closed this issue 2026-05-14 10:44:29 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: alvis/oO#83