Files
oO/README.md
alvis d454a0a8bf docs: ADR-0014 — unified Profile model + agent registry
Propose a shared substrate for per-user prefs, contexts, per-key
consents, and per-agent state so adding an agent stays a manifest
change. Updates CLAUDE.md, README, and architecture docs to reflect
the multi-agent pipeline (ADR-0013) and the registry direction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 10:19:07 +00:00

297 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# oO
> One tip. Right now. Feels like magic.
oO learns who you are from the apps you already use and surfaces **one** perfectly-timed suggestion — an advice or a todo — on a black page. No feed. No dashboard. One tip.
---
## Why
Everyone has too many tasks, too many apps, too much noise. What people actually need is a single, well-chosen nudge at the right moment. oO is that nudge, powered by a recommendation engine that gets smarter the more of your life it sees.
## Product principles
1. **One thing at a time.** The UI is a black page with one tip. That's the product.
2. **We don't own your data, we understand it.** Connect your apps; we read what we need, when we need it.
3. **Magic requires craft.** Precision, timing, and restraint matter more than features.
4. **Private by default.** Tokens are encrypted, models are per-user, deletion is one click.
## Prototype scope (Phase 0)
Three pages. That's it.
| Page | What it does |
|------|--------------|
| **Sign in** | Google / Apple OAuth. No passwords. |
| **Connect** | A list of integrations. Tap "Todoist" → OAuth flow → token stored. |
| **Tip** | Black page. One tip. Tap to dismiss / done / snooze. |
Under the hood the "pick a tip" call already routes through a `recommender` service with a pluggable policy — so v0 is literally "random Todoist task" but every other version slots into the same contract.
---
## Architecture at a glance
```
┌──────────┐ OAuth ┌────────────┐
│ Web / │──────────▶│ auth │
│ Mobile │ └─────┬──────┘
│ client │ │ JWT
│ │ REST/GraphQL ▼
│ │────────▶┌───────────────┐
└──────────┘ │ gateway │──┬──▶ profile
└───────┬───────┘ ├──▶ integrations ──▶ Todoist / Google / ...
│ └──▶ recommender ──▶ ml/serving (Python)
┌───────────────┐
│ events │ ◀── integrations emit normalized events
│ (Kafka/NATS) │ ──▶ ml/pipelines (features, training)
└───────────────┘
```
More detail in [`docs/architecture/`](docs/architecture/) and decisions in [`docs/adr/`](docs/adr/).
## Monorepo layout
See [`CLAUDE.md`](CLAUDE.md) for the full tree and conventions.
```
apps/ web, ios, android
services/ gateway, auth, profile, integrations, recommender, events, notifier
packages/ shared-types, sdk-js, ui
ml/ pipelines, features, registry, experiments, serving
infra/ docker, k8s, terraform, ci
docs/ architecture, adr, api
```
---
## AI stack
oO is AI-native. Domain-specialized agents pre-compute snippets describing the user's state from one angle each; an orchestrator LLM reasons over the assembled snippets and produces one tip (ADR-0013). The orchestrator iterates a registry, not a hardcoded list (ADR-0014) — adding an agent is a manifest change, nothing else.
### Three-tier layout
| Tier | Service | Purpose | Where |
|------|---------|---------|-------|
| Inference | **Ollama** | Local LLM + embedding; no data leaves the host | `localhost:11434` |
| Routing | **LiteLLM** | Unified OpenAI-compatible API; model aliases; cloud fallback | `llm.alogins.net` (Agap shared) |
| Testing | **OpenWebUI** | Prompt iteration, model comparison, manual evals | `ai.alogins.net` (Agap shared) |
### Tip generation pipeline (ADR-0013, M2)
```
User signals Pre-compute agents (every 15 min)
(tasks, calendar, ──▶ ml/agents/{overdue-task, momentum, ──▶ agent_outputs
patterns, time) time-of-day, recent-patterns, (per-agent TTL)
focus-area, ...}
Eligibility filter: required consents + │
active context + per-user prefs (ADR-0014) ◀──┘
Orchestrator prompt (`v4-orchestrator`)
= global prefs + active context + snippets
LiteLLM ──▶ Ollama (local) / cloud fallback
Tip shown to user
User reaction (done / snooze / dismiss + dwell)
Logged to tip_feedback for observability
(no online ML reward loop — see ADR-0013)
```
**Why LiteLLM as gateway:** All LLM calls use a single `LITELLM_URL` env var. Swapping from qwen2.5 to llama3.2, or routing a fraction to Claude for A/B, is a config change in LiteLLM — zero code change in oO. The model name in `tip_scores` tells you exactly which model produced each tip.
**Why Ollama first:** Tips contain personal context. Local inference means no user data leaves the host for the inference path. Cloud models (Anthropic, OpenAI) are opt-in fallbacks for evaluation and simulation only, gated behind `ANTHROPIC_API_KEY`.
### Models (planned; routes through LiteLLM)
| Alias | Model | Task |
|-------|-------|------|
| `tip-generator` | qwen2.5:1.5b (default) | Generate typed tip candidates from user context; local-first via Ollama |
| `embedder` | nomic-embed-text | Task clustering, semantic similarity for dedup; local via Ollama |
| `judge` | claude-haiku-4-5 (cloud, eval-only) | Offline sim judge; rates tip quality for A/B (requires `ANTHROPIC_API_KEY`) |
All model calls route through **LiteLLM** at `llm.alogins.net` (or `LITELLM_URL` env var) using model aliases. This decouples tip generation from model selection — swap the backend model in LiteLLM config without code changes. See ADR-0008.
---
## Roadmap
### Phase 0 — Walking skeleton *(M0)* ✓ shipped
Goal: a single user signs in with Google, connects Todoist, and sees one random Todoist task on a black page. Deletion works.
- [x] Monorepo scaffold, docker-compose dev env
- [x] `auth` — Google OAuth2/PKCE via openid-client v6; session cookie; Next.js middleware guard
- [x] `integrations/todoist` — OAuth2 flow, token stored in DB, disconnect supported
- [x] `recommender` with `RandomPolicy`; stable `POST /recommend` contract; 30s task cache
- [x] `apps/web` — sign-in, connect, tip pages; PWA manifest + icons
- [x] Feedback: `done / snooze / dismiss`; reward inferred from dwell-time (`inferReward`); marks task complete in Todoist
- [x] Deploy modular monolith to Agap VM via Caddy at `o.alogins.net`
- [x] ToS + Privacy Policy pages (`/legal/terms`, `/legal/privacy`); implicit consent on sign-in
- [x] Account deletion: revokes tokens, purges data, soft-deletes profile; button on /connect
- [x] Metrics baseline: `tip_views` table (tip served) + `tip_feedback` (reactions) — activation + reaction rate queryable
### Phase 1 — Real signal + in-the-moment delivery *(M1)* ✓ shipped
Goal: tips are picked, not drawn from a hat — and they arrive at the right moment on the web.
- [x] Event bus scaffold: typed in-process EventEmitter with 500-event ring buffer; subjects match future NATS JetStream — swap is mechanical
- [x] Todoist sync emits `signals.task.synced`; tip served/feedback emit `signals.tip.*`
- [x] Features extracted per task: `is_overdue`, `task_age_days`, `priority`; context: `hour_of_day`, `day_of_week`
- [x] **ε-greedy v1** (d=7, ε=0.10, day-of-week sin/cos features); per-user state persisted to disk
- [x] **ε-greedy v2** (d=12, profile features: completion rate, dismiss rate, dwell, preferred hour, tip volume) in shadow; promoted to active policy (ADR-0012)
- [x] `RemotePolicy` in recommender: calls ml/serving, falls back to RandomPolicy on timeout/error; logs explainability to `tip_scores`
- [x] Feedback loop: dwell-time inferred reward (`inferReward`) → online model update; `done` in 15 s2 min = +1.0 (magic zone)
- [x] Offline simulation framework (`ml/experiments/sim`): rule/LLM/claude-code judges, two-policy comparison, results persisted to `sim_runs` + `sim_events`
- [x] **Web Push** (VAPID): SW, subscribe/unsubscribe API, "notify me" button on tip page
- [x] Shadow-policy registry: run N shadow policies per request, log picks without serving them (#56)
- [x] NATS JetStream bridge — durable `signals.>` and `feedback.>` streams; in-process bus stays the source of truth, every publish bridges out (#21, shipped)
- [x] Per-user profile features (completion rate, dismiss rate, dwell, preferred hour, tip volume) — event-driven, JIT invalidation (#81)
- [ ] Quiet-hours + dedupe for push delivery
- [ ] Delayed rewards: tasks completed directly in Todoist (requires webhook from Todoist)
- [ ] Apple OAuth (deferred to M3)
#### M1 add-on — Admin & ML Ops Console *(fully shipped)*
oO is ML-heavy. Without a cockpit, every model change ships blind. This console is the team's single pane for users, signals, features, models, experiments, and tip outcomes — with the ability to *act* on them (revoke a token, replay an event, promote a model, reset a bandit).
**Framework pick — `apps/admin` on Next.js 15 + Tremor + shadcn/ui.** Analytics-first UI for an analytics-first product, stays on our existing TS/React/Tailwind stack, reuses `packages/shared-types`, `sdk-js`, and the Auth.js session. Specialized ML tooling (MLflow) runs as a **separate external service** linked from the admin shell; Grafana panels are embedded.
| Layer | Tool | Why |
|-------|------|-----|
| App shell | **Next.js 15** (new `apps/admin`) | Same stack as `apps/web`; reuses auth, types, SDK |
| Dashboards / charts | **[Tremor](https://tremor.so)** | Analytics-first React + Tailwind — KPI cards, time-series, categorical, heatmaps |
| CRUD primitives | **[shadcn/ui](https://ui.shadcn.com)** | Copy-paste Radix components; forms, dialogs, command palette |
| Heavy grids | **[TanStack Table v8](https://tanstack.com/table)** | Sortable / paginated / virtualized tables (events, users, tips) |
| Extra charts | **[Recharts](https://recharts.org)** / **[visx](https://airbnb.io/visx)** | Fallbacks where Tremor falls short (e.g. force graphs, Sankey) |
| Model registry / experiments | **[MLflow](https://mlflow.org)** *(external — `o.alogins.net/mlflow`)* | Experiment tracking, artifact browser, model registry; own basic-auth |
| Infra metrics | **[Grafana](https://grafana.com)** *(embedded panels)* | One ops source of truth |
| Ad-hoc analysis | **[Marimo](https://marimo.io)** reactive notebooks | Python-native for the ML side; launch-out link |
| AuthZ | `profile.role='admin'` + Next.js middleware | Reuses existing session; no new auth surface |
**Rejected alternatives (so we don't re-litigate):**
- *Retool / AppSmith* — low-code speed, but admin logic leaves our repo; weak analytics affordances for an analytics product
- *Streamlit / Gradio / Dash* — Python-first; thin RBAC and routing; splits our frontend stack in two
- *React-admin / Refine.dev* — strong CRUD scaffolding, but analytics/ML views feel bolted on; we'd rebuild Tremor-style dashboards ourselves
- *Superset / Metabase as the admin surface* — excellent for BI, poor for operational **writes** (revoke, replay, promote). Plan: **adopt Superset in M4** for BI alongside batch pipelines; ship a read-only SQL widget inside admin for now
**Build sequence:**
1. [x] **ADR-0006** — record the framework choice + "embed, don't rebuild" rule for MLflow/Grafana
2. [x] **Scaffold**`apps/admin` with Next.js 15, Tailwind, Tremor; deploy behind Caddy at `admin.o.alogins.net`
3. [x] **RBAC**`role` column on `users`; admin-only Next.js middleware; seed first admin via `ADMIN_SEED_EMAIL` env; `admin_actions` audit-log table
4. [x] **Overview dashboard** — DAU/WAU KPI cards, tips served, reaction breakdown, activation funnel
5. [x] **User explorer** — list + detail page: identity, consents, integrations, last tip, reward history; revoke-integration + reset-bandit + rebuild-profile actions
6. [x] **Event stream viewer** — live tail of `signals.*` with filters by subject/user/time; same UI when the bus swaps to NATS
7. [x] **Features page** — features sent to `ml/serving` per scoring call; per-user profile features with freshness; diff across time
8. [x] **Tips page** — tips served, scored, feedback reactions with policy/model breakdown
9. [x] **Reward analytics** — reaction distribution over time; per-policy / per-model / per-prompt-version compare; slice by `hour_of_day`, `priority`, cohort
10. [x] **Data quality widget** — missing-feature rate, stale-token rate, daily completeness heatmap; per-feature freshness SLA status
11. [x] **Ops actions** — revoke token (Users page), rebuild profile, reset bandit, enable/disable shadow policies; every action audit-logged
12. [x] **Health rollup**`/admin/health` surfaces api, ml/serving, SQLite, event-bus, MLflow; auto-refreshes every 15s
13. [x] **Read-only SQL runner** — SELECT-only runner against SQLite + saved queries (sunsets to Superset in M4)
14. [x] **Offline simulation runner** — launch `ml/experiments/sim` from admin UI; track sim runs, judge, policy comparison
15. [x] **Token-based admin auth**`POST /api/auth/token` for Playwright/CI; `ADMIN_TOKEN` env var (#105)
16. [x] **Docs pages** — admin documentation and runbooks inline
### Phase 2 — AI tips + multi-source signals *(M2)* in progress
Goal: tips are AI-generated from user context, not just raw Todoist tasks. Multiple signal sources feed a generalized pipeline. Research-intensive milestone.
**Architectural shift (mid-M2):** the bandit-ranks-LLM-candidates design from earlier in M2 was replaced with a multi-agent pipeline (ADR-0013): pre-compute agents emit prompt snippets, an orchestrator LLM produces the tip directly. ADR-0014 layers a unified Profile + agent registry + auto-inference framework on top so the system generalizes cleanly to N agents.
**Multi-agent recommendation (ADR-0013, shipped):**
- [x] `agent_outputs` table + per-agent TTL caching
- [x] Five initial agents: `overdue-task`, `momentum`, `time-of-day`, `recent-patterns`, `focus-area`
- [x] Agent pre-compute scheduler
- [x] Orchestrator cutover — recommender calls `ml/serving` with snippet list, no bandit scoring
- [x] Bandit endpoints + shadow policy machinery removed
**Unified Profile + agent registry (ADR-0014, in progress):**
- [ ] Unified Profile model: prefs, contexts, consents + manifest plumbing + orchestrator cutover (#30)
- [ ] Shared context-inference framework (#111)
- [ ] Per-agent auto-inference: `time-of-day` (#112), `focus-area` (#113), `momentum` (#114), `overdue-task` (#115), `recent-patterns` (#116)
**AI infrastructure (unblock everything else):**
- [ ] `ai` compose profile — Ollama + LiteLLM for local dev; env vars `OLLAMA_URL` / `LITELLM_URL` (#86)
- [ ] AI gateway — wire `ml/serving` to LiteLLM; model aliases `tip-generator` + `embedder` (#87)
**AI tip generation pipeline:**
- [x] Context assembler — user signals + feature store → structured prompt context (`ml/features/context.py`); skeleton implemented
- [ ] Tip generator endpoint — `POST /generate` in `ml/serving`; LLM → N typed `TipCandidate` objects (#79)
- [ ] `TipCandidate` shared schema — `{content, kind, source, model, prompt_version, confidence}`; update recommender pipeline (#89)
- [ ] LLM output validation + retry — JSON schema gate, clarification retry (2×), fallback to task-based (#90)
- [ ] Prompt versioning — `prompt_version` + `model` columns in `tip_scores`; content-hash invalidation (#91)
- [x] LLM tip quality dashboard — reaction breakdown by model / prompt_version in `/admin/reward-analytics` (#92)
**Evaluation & model selection:**
- [ ] Model benchmark — compare qwen2.5:7b / llama3.2:3b / gemma3:4b via offline sim + LLM judge (#93)
- [ ] LLM prompt research — persona design, context injection strategies, few-shot examples (#84)
**Pipeline architecture:**
- [x] Signal source abstraction — `SignalSource` interface for Todoist + extensible design (#78)
- [ ] Generalized recommendation pipeline — candidate → rank → render stages (#80)
- [x] Feature registry + user profile builder — centralized features, persistent profiles, event-driven invalidation (#81)
- [ ] Tip kind system — task, advice, insight, reminder with kind-aware UI + rewards (#82)
**Policy research:**
- [ ] Next-gen policies — Thompson sampling, neural bandits, hybrid transfer learning (#83)
**Integrations & infra (carried from M1):**
- [ ] Apple OAuth (#7)
- [x] NATS JetStream replacing in-process bus (#21) — adapter ships in `services/api/src/events/nats.ts`; in-proc bus is the producer, JetStream is the durable mirror
- [x] Todoist sync via events (#22) — background scheduler in `services/api/src/signals/scheduler.ts` emits `signals.task.synced` every `TODOIST_SYNC_INTERVAL_MS`; on-demand fetch remains as freshness fallback
- [x] Event schema registry + protobuf CI gate (#54) — buf lint/breaking checks on every PR
- [x] Per-user freshness SLAs for features (#61) — context-feature (JIT) vs profile-feature (batched) spec in ADR-0011; CONTEXT_FEATURES in ml/features/context.py
- [x] Observability (#18) — structured logs via pino, W3C trace IDs, Sentry hooks, trace correlation end-to-end
- [ ] CI skeleton (#3), E2E tests (#20)
**Bugs & UX (fix before new features):**
- [x] TipFeedback type mismatch (#73)
- [x] Todoist token refresh (#74) — OAuth token auto-refresh on 401
- [x] Reward fire-and-forget (#75) — retry logic + logging
- [x] Data retention purge (#76) — daily purge of 30-day-old tip_scores/tip_feedback
- [x] Port mismatch (#77) — fixed in docker-compose + env var config
- [ ] UX refinements (#100102) — "done/snooze/dismiss" feedback only, config page UI, settings gear button
### Phase 3 — Native mobile *(M3)*
- [ ] iOS app (SwiftUI) with APNs push
- [ ] Android app (Compose) with FCM push
- [ ] `notifier` gains APNs + FCM channels, per-device rate limits
- [ ] Migrate auth from Auth.js to dedicated OIDC provider (trigger from ADR-0004)
- [ ] Consolidate MLflow behind shared OIDC (SSO for all internal services)
- [ ] Decide-and-deliver scheduler: per-user "is this tip worth interrupting now?" threshold
### Phase 4 — MLOps at scale *(M4)*
- [x] MLflow deployed as external service (`mlops` compose profile); own auth; health check integrated
- [ ] Write first retraining pipeline + first MLflow experiment logging from `ml/serving` + JetStream consumers (#98)
- [ ] Feature-to-prompt pipeline — nightly batch job materializes context for LLM; cuts inline latency (#94)
- [ ] Prompt optimization loop — sim A/B → MLflow experiment → human-approved promotion (#95)
- [ ] LLM fine-tuning — tip reactions as training signal; LoRA on base model; MLflow tracks runs (#96)
- [ ] Embedding-based task clustering — `nomic-embed-text` for dedup + user pattern features (#97)
- [ ] Modular-monolith packaging + import-boundary lint (#47)
- [ ] Consolidate MLflow auth into shared OIDC provider (tracked as M3 issue #85)
- [ ] Shadow → A/B → launch pipeline as first-class in MLflow
- [ ] Online experiments framework: deterministic assignment + bandit policies alongside fixed-split A/B
- [ ] Cross-user collaborative features (opt-in only); cohort slicing; fairness checks
- [ ] Drift monitoring (feature + prediction + reward drift); model cards per LLM version
### Phase 5 — Production hardening *(M5)*
- [ ] Audit logging, rotation of provider tokens + internal signing keys
- [ ] **k3s** on existing VM, then k8s + HPA once multi-node justified (no cliff)
- [ ] Multi-region failover, Postgres PITR, event-bus mirroring
- [ ] Public integration SDK; sandbox tenancy for third-party connectors
- [ ] Billing + subscription tiers
---
## Contributing
This repo is split into independent modules; most tickets belong to exactly one. Pick an issue, check its milestone (= phase), read the service's `README.md`, ship.
Conventions and per-service guidance live in [`CLAUDE.md`](CLAUDE.md).
## License
All rights reserved — 2026. Contact the owner for licensing inquiries.
(We'll switch to an OSS license for non-sensitive packages once the public SDK lands in Phase 5.)