research: next-gen ranking policies — Thompson sampling, neural bandits, hybrid #83

New Issue

alvis · 2026-04-16T15:26:11Z

alvis commented

2026-04-16 15:26:11 +00:00

Motivation

ε-greedy v1 is a strong baseline but has known limitations:

Fixed ε=0.10 (no decay, no adaptive exploration)
Linear reward model (ridge regression on d=7 features)
No cross-user learning (cold-start problem for new users)

Research directions

1. Thompson Sampling

Posterior sampling instead of ε-greedy exploration
Better exploration/exploitation tradeoff (information-directed)
Can use same linear model (Bayesian linear regression)
Benchmark: offline sim against ε-greedy v1

2. Neural Contextual Bandits

Replace linear ridge with a small NN (e.g. 2-layer MLP)
Captures non-linear feature interactions
Per-user fine-tuning from global model (transfer learning)
Risk: overfits with few observations per user

3. Hybrid: global model → per-user adaptation

Train a global model on all users (cold-start solution)
Fine-tune per-user as data accumulates
Smooth transition from population prior to personalized posterior

4. Explore-then-commit

Pure exploration for first N tips, then exploit
Simple baseline for new users

5. Reward model research

Learn personalized reward functions (not hardcoded dwell brackets)
Implicit signals: session duration, return frequency, time-of-day patterns
Multi-objective: helpfulness × timing × variety

Methodology

All experiments via offline sim framework (ml/experiments/sim/)
Compare on: mean reward, regret, cold-start performance, diversity
Promote winner via shadow → ADR → active (same process as ADR-0007)

Tasks

Implement Thompson sampling policy in ml/serving
Implement neural bandit (small MLP) policy
Implement global-then-personalize transfer policy
Extended offline sim: cold-start scenarios, diverse personas
Benchmark all policies against ε-greedy v1
ADR for next promoted policy

## Motivation ε-greedy v1 is a strong baseline but has known limitations: - Fixed ε=0.10 (no decay, no adaptive exploration) - Linear reward model (ridge regression on d=7 features) - No cross-user learning (cold-start problem for new users) ## Research directions ### 1. Thompson Sampling - Posterior sampling instead of ε-greedy exploration - Better exploration/exploitation tradeoff (information-directed) - Can use same linear model (Bayesian linear regression) - **Benchmark**: offline sim against ε-greedy v1 ### 2. Neural Contextual Bandits - Replace linear ridge with a small NN (e.g. 2-layer MLP) - Captures non-linear feature interactions - Per-user fine-tuning from global model (transfer learning) - **Risk**: overfits with few observations per user ### 3. Hybrid: global model → per-user adaptation - Train a global model on all users (cold-start solution) - Fine-tune per-user as data accumulates - Smooth transition from population prior to personalized posterior ### 4. Explore-then-commit - Pure exploration for first N tips, then exploit - Simple baseline for new users ### 5. Reward model research - Learn personalized reward functions (not hardcoded dwell brackets) - Implicit signals: session duration, return frequency, time-of-day patterns - Multi-objective: helpfulness × timing × variety ## Methodology - All experiments via offline sim framework (`ml/experiments/sim/`) - Compare on: mean reward, regret, cold-start performance, diversity - Promote winner via shadow → ADR → active (same process as ADR-0007) ## Tasks - [ ] Implement Thompson sampling policy in ml/serving - [ ] Implement neural bandit (small MLP) policy - [ ] Implement global-then-personalize transfer policy - [ ] Extended offline sim: cold-start scenarios, diverse personas - [ ] Benchmark all policies against ε-greedy v1 - [ ] ADR for next promoted policy

alvis added this to the M2 — AI tips + multi-source signals milestone 2026-04-16 15:26:11 +00:00

alvis closed this issue

2026-05-14 10:44:29 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: alvis/oO#83