Files
oO/README.md
alvis 85367aeaa0 feat: MLOps external services, AI stack planning, admin MLOps hub
Infrastructure:
- Add `mlops` compose profile: MLflow (basic-auth, /mlflow path) + Airflow (LocalExecutor, /airflow path) + airflow-db
- infra/mlflow/basic_auth.ini for MLflow auth config
- Caddy routes /mlflow* and /airflow* inside existing o.alogins.net block (see agap_git)
- Dockerfile.admin: NEXT_PUBLIC_MLFLOW_URL / NEXT_PUBLIC_AIRFLOW_URL build args (default /mlflow, /airflow)

Admin panel:
- /admin/models: replace MLflow iframe with external link cards
- /admin/experiments: replace LinUCB stats with MLOps hub (links to MLflow experiments/models + Airflow DAGs/datasets)
- AdminShell: external nav links for MLflow ↗ and Airflow ↗ under MLOps section

Docs & planning:
- README: new AI stack section (Ollama/LiteLLM/OpenWebUI three-tier, tip generation pipeline, model aliases)
- README: Phase 2 expanded with AI infra issues (#86-#93) and granular pipeline breakdown
- README: Phase 4 expanded with LLM MLOps items (#94-#97)
- CLAUDE.md: AI stack section, updated current phase (M1 shipped / M2 in progress), compose profiles, updated What NOT to do
- docs/architecture/overview.md: AI stack section, updated decision flow diagram for Phase 2 LLM pipeline
- ADR-0006: updated to reflect external services (path-based, not embedded)
- Gitea issues #86-#97 created (M2: AI infra + pipeline; M4: LLM MLOps)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 08:20:44 +00:00

276 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# oO
> One tip. Right now. Feels like magic.
oO learns who you are from the apps you already use and surfaces **one** perfectly-timed suggestion — an advice or a todo — on a black page. No feed. No dashboard. One tip.
---
## Why
Everyone has too many tasks, too many apps, too much noise. What people actually need is a single, well-chosen nudge at the right moment. oO is that nudge, powered by a recommendation engine that gets smarter the more of your life it sees.
## Product principles
1. **One thing at a time.** The UI is a black page with one tip. That's the product.
2. **We don't own your data, we understand it.** Connect your apps; we read what we need, when we need it.
3. **Magic requires craft.** Precision, timing, and restraint matter more than features.
4. **Private by default.** Tokens are encrypted, models are per-user, deletion is one click.
## Prototype scope (Phase 0)
Three pages. That's it.
| Page | What it does |
|------|--------------|
| **Sign in** | Google / Apple OAuth. No passwords. |
| **Connect** | A list of integrations. Tap "Todoist" → OAuth flow → token stored. |
| **Tip** | Black page. One tip. Tap to dismiss / done / snooze. |
Under the hood the "pick a tip" call already routes through a `recommender` service with a pluggable policy — so v0 is literally "random Todoist task" but every other version slots into the same contract.
---
## Architecture at a glance
```
┌──────────┐ OAuth ┌────────────┐
│ Web / │──────────▶│ auth │
│ Mobile │ └─────┬──────┘
│ client │ │ JWT
│ │ REST/GraphQL ▼
│ │────────▶┌───────────────┐
└──────────┘ │ gateway │──┬──▶ profile
└───────┬───────┘ ├──▶ integrations ──▶ Todoist / Google / ...
│ └──▶ recommender ──▶ ml/serving (Python)
┌───────────────┐
│ events │ ◀── integrations emit normalized events
│ (Kafka/NATS) │ ──▶ ml/pipelines (features, training)
└───────────────┘
```
More detail in [`docs/architecture/`](docs/architecture/) and decisions in [`docs/adr/`](docs/adr/).
## Monorepo layout
See [`CLAUDE.md`](CLAUDE.md) for the full tree and conventions.
```
apps/ web, ios, android
services/ gateway, auth, profile, integrations, recommender, events, notifier
packages/ shared-types, sdk-js, ui
ml/ pipelines, features, registry, experiments, serving
infra/ docker, k8s, terraform, ci
docs/ architecture, adr, api
```
---
## AI stack
oO is AI-native: the recommender's job is to **rank**, not to write. An LLM generates candidate tips from the user's context; the bandit picks the best one.
### Three-tier layout
| Tier | Service | Purpose | Where |
|------|---------|---------|-------|
| Inference | **Ollama** | Local LLM + embedding; no data leaves the host | `localhost:11434` |
| Routing | **LiteLLM** | Unified OpenAI-compatible API; model aliases; cloud fallback | `llm.alogins.net` (Agap shared) |
| Testing | **OpenWebUI** | Prompt iteration, model comparison, manual evals | `ai.alogins.net` (Agap shared) |
### Tip generation pipeline (Phase 2 target)
```
User signals ──▶ Context assembler ──▶ LiteLLM ──▶ Ollama (local)
(tasks, calendar, (ml/features/) (routing) or cloud fallback
patterns, time)
N typed TipCandidates
{content, kind, model,
prompt_version, confidence}
Bandit policy (ml/serving)
scores + ranks candidates
Best tip shown
User reaction (done / snooze / dismiss + dwell)
Online bandit update + prompt_version tracking
```
**Why LiteLLM as gateway:** All LLM calls use a single `LITELLM_URL` env var. Swapping from qwen2.5 to llama3.2, or routing a fraction to Claude for A/B, is a config change in LiteLLM — zero code change in oO. The model name in `tip_scores` tells you exactly which model produced each tip.
**Why Ollama first:** Tips contain personal context. Local inference means no user data leaves the host for the inference path. Cloud models (Anthropic, OpenAI) are opt-in fallbacks for evaluation and simulation only, gated behind `ANTHROPIC_API_KEY`.
### Models (planned)
| Alias | Model | Task |
|-------|-------|------|
| `tip-generator` | qwen2.5:7b (default) | Generate typed tip candidates from user context |
| `embedder` | nomic-embed-text | Task clustering, semantic similarity for dedup |
| `judge` | claude-haiku-4-5 (cloud, eval-only) | Offline sim judge; rates tip quality for A/B |
---
## Roadmap
### Phase 0 — Walking skeleton *(M0)* ✓ shipped
Goal: a single user signs in with Google, connects Todoist, and sees one random Todoist task on a black page. Deletion works.
- [x] Monorepo scaffold, docker-compose dev env
- [x] `auth` — Google OAuth2/PKCE via openid-client v6; session cookie; Next.js middleware guard
- [x] `integrations/todoist` — OAuth2 flow, token stored in DB, disconnect supported
- [x] `recommender` with `RandomPolicy`; stable `POST /recommend` contract; 30s task cache
- [x] `apps/web` — sign-in, connect, tip pages; PWA manifest + icons
- [x] Feedback: `done / snooze / dismiss`; reward inferred from dwell-time (`inferReward`); marks task complete in Todoist
- [x] Deploy modular monolith to Agap VM via Caddy at `o.alogins.net`
- [x] ToS + Privacy Policy pages (`/legal/terms`, `/legal/privacy`); implicit consent on sign-in
- [x] Account deletion: revokes tokens, purges data, soft-deletes profile; button on /connect
- [x] Metrics baseline: `tip_views` table (tip served) + `tip_feedback` (reactions) — activation + reaction rate queryable
### Phase 1 — Real signal + in-the-moment delivery *(M1)* ✓ shipped
Goal: tips are picked, not drawn from a hat — and they arrive at the right moment on the web.
- [x] Event bus scaffold: typed in-process EventEmitter with 500-event ring buffer; subjects match future NATS JetStream — swap is mechanical
- [x] Todoist sync emits `signals.task.synced`; tip served/feedback emit `signals.tip.*`
- [x] Features extracted per task: `is_overdue`, `task_age_days`, `priority`; context: `hour_of_day`, `day_of_week`
- [x] `ml/serving` LinUCB (d=5) + **ε-greedy v1** (d=7, ε=0.10, day-of-week sin/cos features); per-user state persisted to disk
- [x] `RemotePolicy` in recommender: calls ml/serving, falls back to RandomPolicy on timeout/error; logs explainability to `tip_scores`
- [x] Feedback loop: dwell-time inferred reward (`inferReward`) → online model update; `done` in 15 s2 min = +1.0 (magic zone)
- [x] Offline simulation framework (`ml/experiments/sim`): rule/LLM/claude-code judges, two-policy comparison, results persisted to `sim_runs` + `sim_events`
- [x] **ε-greedy v1 promoted to active policy** (ADR-0007) — +10.7% mean reward vs LinUCB in offline sim
- [x] **Web Push** (VAPID): SW, subscribe/unsubscribe API, "notify me" button on tip page
- [x] Shadow-policy registry: run N shadow policies per request, log picks without serving them (#56)
- [ ] Quiet-hours + dedupe for push delivery
- [ ] Delayed rewards: tasks completed directly in Todoist (requires webhook from Todoist)
- [ ] NATS JetStream replacing in-process bus (when multi-process pressure arrives)
#### M1 add-on — Admin & ML Ops Console *(fully shipped)*
oO is ML-heavy. Without a cockpit, every model change ships blind. This console is the team's single pane for users, signals, features, models, experiments, and tip outcomes — with the ability to *act* on them (revoke a token, replay an event, promote a model, reset a bandit).
**Framework pick — `apps/admin` on Next.js 15 + Tremor + shadcn/ui.** Analytics-first UI for an analytics-first product, stays on our existing TS/React/Tailwind stack, reuses `packages/shared-types`, `sdk-js`, and the Auth.js session. Specialized ML tooling (MLflow, Airflow) runs as **separate external services** linked from the admin shell; Grafana panels are embedded.
| Layer | Tool | Why |
|-------|------|-----|
| App shell | **Next.js 15** (new `apps/admin`) | Same stack as `apps/web`; reuses auth, types, SDK |
| Dashboards / charts | **[Tremor](https://tremor.so)** | Analytics-first React + Tailwind — KPI cards, time-series, categorical, heatmaps |
| CRUD primitives | **[shadcn/ui](https://ui.shadcn.com)** | Copy-paste Radix components; forms, dialogs, command palette |
| Heavy grids | **[TanStack Table v8](https://tanstack.com/table)** | Sortable / paginated / virtualized tables (events, users, tips) |
| Extra charts | **[Recharts](https://recharts.org)** / **[visx](https://airbnb.io/visx)** | Fallbacks where Tremor falls short (e.g. force graphs, Sankey) |
| Model registry / experiments | **[MLflow](https://mlflow.org)** *(external — `o.alogins.net/mlflow`)* | Experiment tracking, artifact browser, model registry; own basic-auth |
| Pipeline orchestration | **[Airflow](https://airflow.apache.org)** *(external — `o.alogins.net/airflow`)* | Batch feature + retraining DAGs; own web-auth |
| Infra metrics | **[Grafana](https://grafana.com)** *(embedded panels)* | One ops source of truth |
| Ad-hoc analysis | **[Marimo](https://marimo.io)** reactive notebooks | Python-native for the ML side; launch-out link |
| AuthZ | `profile.role='admin'` + Next.js middleware | Reuses existing session; no new auth surface |
**Rejected alternatives (so we don't re-litigate):**
- *Retool / AppSmith* — low-code speed, but admin logic leaves our repo; weak analytics affordances for an analytics product
- *Streamlit / Gradio / Dash* — Python-first; thin RBAC and routing; splits our frontend stack in two
- *React-admin / Refine.dev* — strong CRUD scaffolding, but analytics/ML views feel bolted on; we'd rebuild Tremor-style dashboards ourselves
- *Superset / Metabase as the admin surface* — excellent for BI, poor for operational **writes** (revoke, replay, promote). Plan: **adopt Superset in M4** for BI alongside batch pipelines; ship a read-only SQL widget inside admin for now
**Build sequence (plan, not code):**
1. [x] **ADR-0006** — record the framework choice + "embed, don't rebuild" rule for MLflow/Grafana
2. [x] **Scaffold**`apps/admin` with Next.js 15, Tailwind, Tremor; deploy behind Caddy at `admin.o.alogins.net`
3. [x] **RBAC**`role` column on `users`; admin-only Next.js middleware; seed first admin via `ADMIN_SEED_EMAIL` env; `admin_actions` audit-log table
4. [x] **Overview dashboard** — DAU/WAU KPI cards, tips served, reaction breakdown, activation funnel
5. [x] **User explorer** — list + detail page: identity, consents, integrations, last tip, reward history; revoke-integration + reset-bandit actions
6. [x] **Event stream viewer** — live tail of `signals.*` with filters by subject/user/time; same UI when the bus swaps to NATS
7. [x] **Feature store browser** — features sent to `ml/serving` per scoring call; diff across time for a user
8. [x] **Model registry panel**`/admin/models` links out to MLflow (`mlflow.o.alogins.net`); experiment tracking and dataset management in MLflow + Airflow
9. [x] **MLOps hub**`/admin/experiments` links to MLflow experiments/models and Airflow DAGs/datasets; bandit reset on Users page
10. [x] **Recommendation log (explainability)** — per served tip: `(user, features, policy, score, feedback, latency)`; `tip_scores` table, 30-day retention
11. [x] **Reward analytics** — reaction distribution over time; per-policy compare; slice by `hour_of_day`, `priority`, cohort
12. [x] **Data quality widget** — missing-feature rate, stale-token rate, daily completeness heatmap
13. [x] **Ops actions** — revoke token (Users page), replay signal, disable/promote shadow policy; every action audit-logged
14. [x] **Read-only SQL runner** — SELECT-only runner against SQLite + saved queries (sunsets to Superset in M4)
15. [x] **Health rollup**`/admin/health` surfaces api, ml/serving, SQLite, event-bus; auto-refreshes every 15s
16. [ ] **Docs**`apps/admin/README.md`, runbook for common ops actions, ADR-0006 merged
- [ ] Apple OAuth (deferred to M2)
### Phase 2 — AI tips + multi-source signals *(M2)*
Goal: tips are AI-generated from user context, not just raw Todoist tasks. Multiple signal sources feed a generalized pipeline. Research-intensive milestone.
**AI infrastructure (unblock everything else):**
- [ ] `ai` compose profile — Ollama + LiteLLM for local dev; env vars `OLLAMA_URL` / `LITELLM_URL` (#86)
- [ ] AI gateway — wire `ml/serving` to LiteLLM; model aliases `tip-generator` + `embedder` (#87)
**AI tip generation pipeline:**
- [ ] Context assembler — user signals + feature store → structured prompt context (`ml/features/context.py`) (#88)
- [ ] Tip generator endpoint — `POST /generate` in `ml/serving`; LLM → N typed `TipCandidate` objects (#79)
- [ ] `TipCandidate` shared schema — `{content, kind, source, model, prompt_version, confidence}`; update recommender pipeline (#89)
- [ ] LLM output validation + retry — JSON schema gate, clarification retry (2×), fallback to task-based (#90)
- [ ] Prompt versioning — `prompt_version` + `model` columns in `tip_scores`; content-hash invalidation (#91)
- [ ] LLM tip quality dashboard — reaction breakdown by model / prompt_version in `/admin/reward-analytics` (#92)
**Evaluation & model selection:**
- [ ] Model benchmark — compare qwen2.5:7b / llama3.2:3b / gemma3:4b via offline sim + LLM judge (#93)
- [ ] LLM prompt research — persona design, context injection strategies, few-shot examples (#84)
**Pipeline architecture:**
- [ ] Signal source abstraction — `SignalSource` interface generalizing beyond Todoist (#78)
- [ ] Generalized recommendation pipeline — candidate → rank → render stages (#80)
- [ ] Feature registry + user profile builder — centralized features, persistent profiles (#81)
- [ ] Tip kind system — task, advice, insight, reminder with kind-aware UI + rewards (#82)
**Policy research:**
- [ ] Next-gen policies — Thompson sampling, neural bandits, hybrid transfer learning (#83)
**Integrations & infra (carried from M1):**
- [ ] Apple OAuth (#7)
- [ ] NATS JetStream replacing in-process bus (#21)
- [ ] Todoist sync via events (#22)
- [ ] Event schema registry + protobuf CI gate (#54)
- [ ] Per-user freshness SLAs for features (#61)
- [ ] CI skeleton (#3), observability (#18), E2E tests (#20)
**Bugs (fix before new features):**
- [ ] TipFeedback type mismatch (#73)
- [ ] Todoist token refresh (#74)
- [ ] Reward fire-and-forget (#75)
- [ ] Data retention purge (#76)
- [ ] Port mismatch (#77)
### Phase 3 — Native mobile *(M3)*
- [ ] iOS app (SwiftUI) with APNs push
- [ ] Android app (Compose) with FCM push
- [ ] `notifier` gains APNs + FCM channels, per-device rate limits
- [ ] Migrate auth from Auth.js to dedicated OIDC provider (trigger from ADR-0004)
- [ ] Consolidate MLflow + Airflow behind shared OIDC (SSO for all internal services)
- [ ] Decide-and-deliver scheduler: per-user "is this tip worth interrupting now?" threshold
### Phase 4 — MLOps at scale *(M4)*
- [x] Airflow + MLflow deployed as external services (`mlops` compose profile); each with own auth
- [ ] Write first retraining DAG (Airflow) + first MLflow experiment logging from `ml/serving`
- [ ] Feature-to-prompt pipeline — nightly Airflow DAG materializes context for LLM; cuts inline latency (#94)
- [ ] Prompt optimization loop — sim A/B → MLflow experiment → human-approved promotion (#95)
- [ ] LLM fine-tuning — tip reactions as training signal; LoRA on base model; MLflow tracks runs (#96)
- [ ] Embedding-based task clustering — `nomic-embed-text` for dedup + user pattern features (#97)
- [ ] Consolidate MLflow + Airflow auth into shared OIDC provider (tracked as M3 issue #85)
- [ ] Shadow → A/B → launch pipeline as first-class in MLflow
- [ ] Online experiments framework: deterministic assignment + bandit policies alongside fixed-split A/B
- [ ] Cross-user collaborative features (opt-in only); cohort slicing; fairness checks
- [ ] Drift monitoring (feature + prediction + reward drift); model cards per LLM version
### Phase 5 — Production hardening *(M5)*
- [ ] Audit logging, rotation of provider tokens + internal signing keys
- [ ] **k3s** on existing VM, then k8s + HPA once multi-node justified (no cliff)
- [ ] Multi-region failover, Postgres PITR, event-bus mirroring
- [ ] Public integration SDK; sandbox tenancy for third-party connectors
- [ ] Billing + subscription tiers
---
## Contributing
This repo is split into independent modules; most tickets belong to exactly one. Pick an issue, check its milestone (= phase), read the service's `README.md`, ship.
Conventions and per-service guidance live in [`CLAUDE.md`](CLAUDE.md).
## License
All rights reserved — 2026. Contact the owner for licensing inquiries.
(We'll switch to an OSS license for non-sensitive packages once the public SDK lands in Phase 5.)