# ADR-0013 — Multi-agent recommendation: pre-computed agent snippets + orchestrator LLM **Status:** Accepted **Date:** 2026-05-01 **Supersedes:** ADR-0007, ADR-0012 ## Context The ε-greedy bandit (ADR-0007, promoted to v2 in ADR-0012) was the first recommendation policy. It served adequately during early M1 testing but carries structural problems that become more acute as the user base grows: - **Training signal sparsity.** The median user generates fewer than 5 reward signals per week. Ridge regression on a 12-dimensional feature vector needs far more signal than that to converge to a meaningful θ before the user loses interest. - **Cold-start cost.** Every new user starts with an uninformed identity matrix. Early tips are essentially random for the first weeks of use — precisely when first impressions matter most. - **Opacity.** The bandit cannot explain why it chose a tip. An orchestrator that reasons explicitly over named agent outputs ("3 overdue tasks + peak hour approaching") is interpretable by design. - **Coupling of generation and selection.** The current pipeline generates candidates, then scores them; the scoring is decoupled from the LLM reasoning. Giving the LLM the full pre-computed context directly is a simpler and more capable design. ## Decision Replace the RL bandit with a **multi-agent pipeline**: ### Sub-agents (async, pre-computed) Multiple domain-specialized Python agents each analyze user state from one angle and produce a **prompt snippet** — a short natural-language paragraph describing what they found. They do not produce tips. They run periodically (every 15 minutes) and store results in the new `agent_outputs` table with per-agent TTLs. Initial agent set: | Agent | ID | TTL | |---|---|---| | OverdueTaskAgent | `overdue-task` | 1h | | MomentumAgent | `momentum` | 6h | | TimeOfDayAgent | `time-of-day` | 15m | | RecentPatternsAgent | `recent-patterns` | 24h | | FocusAreaAgent | `focus-area` | 12h | ### Orchestrator agent (real-time) When a user requests a tip, the TypeScript recommender: 1. Fetches all non-expired `agent_outputs` rows for the user. 2. Calls `POST /recommend` on `ml/serving` with the snippet list. 3. `ml/serving` assembles a single orchestrator prompt (template `v4-orchestrator`) that concatenates all snippets, then calls LiteLLM via the existing `tip-generator` alias to produce one tip. No bandit scoring. No reward delivery to an ML model. The LLM receives full context and generates the tip in one call. ### Feedback `tipFeedback` rows are still written on every user reaction. `inferReward()` still runs and `rewardMilli` is logged for observability and potential future supervised learning. Reactions are not delivered to an ML endpoint. ## New data model ```sql CREATE TABLE agent_outputs ( id TEXT PRIMARY KEY, user_id TEXT NOT NULL REFERENCES users(id), agent_id TEXT NOT NULL, -- e.g. 'overdue-task' prompt_text TEXT NOT NULL, -- snippet produced by the agent signals_snapshot TEXT, -- JSON: inputs the agent consumed computed_at TEXT NOT NULL, -- ISO 8601 expires_at TEXT NOT NULL, -- ISO 8601 = computed_at + TTL agent_version TEXT NOT NULL -- bump to invalidate cached outputs on logic changes ); CREATE INDEX idx_agent_outputs_user_agent_exp ON agent_outputs(user_id, agent_id, expires_at DESC); ``` ## Consequences ### Positive - Tips are explainable: `featuresJson` in `tipScores` records which agents contributed. - Cold-start is eliminated: the orchestrator reasons from signals immediately, no warm-up. - Adding or removing an agent is a self-contained change in `ml/agents/`. - Swapping LLM models remains a config change (LiteLLM alias unchanged). ### Negative / risks - **No automatic exploration.** The bandit would discover that a user prefers certain tip types without being told. The orchestrator only knows what the agents tell it. Mitigation: agents can evolve to encode richer signals; offline evaluation via the existing bench scripts remain available. - **Scheduler dependency.** If the pre-compute job falls behind, agent outputs go stale. Mitigation: the orchestrator falls back to raw signal prompt when no outputs exist; `TimeOfDayAgent` recomputes every 15 min to stay fresh. - **Higher per-request token cost.** The orchestrator prompt is longer than the old bandit prompt. Mitigation: the `tip-generator` alias points to a small local model; token cost is negligible at current scale. ## Migration sequence See plan document in conversation context. 10 steps; each independently deployable and rollback-able. Cutover is Step 6 (single TypeScript PR). Bandit endpoints removed in Step 7 after 48h clean traffic.