Drop all four Airflow containers (db, init, webserver, scheduler) from the mlops compose profile, leaving MLflow as the sole mlops service. Remove AIRFLOW_* env vars, config fields, health-check entries, DAG trigger code in admin/bench routes, the airflow_dag_run_id schema column, Airflow nav links and DAG-run links in the admin UI, the two Airflow DAG files (bench_dag.py, sim_dag.py), and all related docs/ADR references. Simulations now run exclusively via the subprocess path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.6 KiB
ADR-0013 — Multi-agent recommendation: pre-computed agent snippets + orchestrator LLM
Status: Accepted Date: 2026-05-01 Supersedes: ADR-0007, ADR-0012
Context
The ε-greedy bandit (ADR-0007, promoted to v2 in ADR-0012) was the first recommendation policy. It served adequately during early M1 testing but carries structural problems that become more acute as the user base grows:
- Training signal sparsity. The median user generates fewer than 5 reward signals per week. Ridge regression on a 12-dimensional feature vector needs far more signal than that to converge to a meaningful θ before the user loses interest.
- Cold-start cost. Every new user starts with an uninformed identity matrix. Early tips are essentially random for the first weeks of use — precisely when first impressions matter most.
- Opacity. The bandit cannot explain why it chose a tip. An orchestrator that reasons explicitly over named agent outputs ("3 overdue tasks + peak hour approaching") is interpretable by design.
- Coupling of generation and selection. The current pipeline generates candidates, then scores them; the scoring is decoupled from the LLM reasoning. Giving the LLM the full pre-computed context directly is a simpler and more capable design.
Decision
Replace the RL bandit with a multi-agent pipeline:
Sub-agents (async, pre-computed)
Multiple domain-specialized Python agents each analyze user state from one angle and
produce a prompt snippet — a short natural-language paragraph describing what they
found. They do not produce tips. They run periodically (every 15 minutes) and store
results in the new agent_outputs table with per-agent TTLs.
Initial agent set:
| Agent | ID | TTL |
|---|---|---|
| OverdueTaskAgent | overdue-task |
1h |
| MomentumAgent | momentum |
6h |
| TimeOfDayAgent | time-of-day |
15m |
| RecentPatternsAgent | recent-patterns |
24h |
| FocusAreaAgent | focus-area |
12h |
Orchestrator agent (real-time)
When a user requests a tip, the TypeScript recommender:
- Fetches all non-expired
agent_outputsrows for the user. - Calls
POST /recommendonml/servingwith the snippet list. ml/servingassembles a single orchestrator prompt (templatev4-orchestrator) that concatenates all snippets, then calls LiteLLM via the existingtip-generatoralias to produce one tip.
No bandit scoring. No reward delivery to an ML model. The LLM receives full context and generates the tip in one call.
Feedback
tipFeedback rows are still written on every user reaction. inferReward() still runs
and rewardMilli is logged for observability and potential future supervised learning.
Reactions are not delivered to an ML endpoint.
New data model
CREATE TABLE agent_outputs (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL REFERENCES users(id),
agent_id TEXT NOT NULL, -- e.g. 'overdue-task'
prompt_text TEXT NOT NULL, -- snippet produced by the agent
signals_snapshot TEXT, -- JSON: inputs the agent consumed
computed_at TEXT NOT NULL, -- ISO 8601
expires_at TEXT NOT NULL, -- ISO 8601 = computed_at + TTL
agent_version TEXT NOT NULL -- bump to invalidate cached outputs on logic changes
);
CREATE INDEX idx_agent_outputs_user_agent_exp
ON agent_outputs(user_id, agent_id, expires_at DESC);
Consequences
Positive
- Tips are explainable:
featuresJsonintipScoresrecords which agents contributed. - Cold-start is eliminated: the orchestrator reasons from signals immediately, no warm-up.
- Adding or removing an agent is a self-contained change in
ml/agents/. - Swapping LLM models remains a config change (LiteLLM alias unchanged).
Negative / risks
- No automatic exploration. The bandit would discover that a user prefers certain tip types without being told. The orchestrator only knows what the agents tell it. Mitigation: agents can evolve to encode richer signals; offline evaluation via the existing bench scripts remain available.
- Scheduler dependency. If the pre-compute job falls behind, agent outputs go
stale. Mitigation: the orchestrator falls back to raw signal prompt when no outputs
exist;
TimeOfDayAgentrecomputes every 15 min to stay fresh. - Higher per-request token cost. The orchestrator prompt is longer than the old bandit
prompt. Mitigation: the
tip-generatoralias points to a small local model; token cost is negligible at current scale.
Migration sequence
See plan document in conversation context. 10 steps; each independently deployable and rollback-able. Cutover is Step 6 (single TypeScript PR). Bandit endpoints removed in Step 7 after 48h clean traffic.