feat: MLOps external services, AI stack planning, admin MLOps hub

Infrastructure: - Add `mlops` compose profile: MLflow (basic-auth, /mlflow path) + Airflow (LocalExecutor, /airflow path) + airflow-db - infra/mlflow/basic_auth.ini for MLflow auth config - Caddy routes /mlflow* and /airflow* inside existing o.alogins.net block (see agap_git) - Dockerfile.admin: NEXT_PUBLIC_MLFLOW_URL / NEXT_PUBLIC_AIRFLOW_URL build args (default /mlflow, /airflow) Admin panel: - /admin/models: replace MLflow iframe with external link cards - /admin/experiments: replace LinUCB stats with MLOps hub (links to MLflow experiments/models + Airflow DAGs/datasets) - AdminShell: external nav links for MLflow ↗ and Airflow ↗ under MLOps section Docs & planning: - README: new AI stack section (Ollama/LiteLLM/OpenWebUI three-tier, tip generation pipeline, model aliases) - README: Phase 2 expanded with AI infra issues (#86-#93) and granular pipeline breakdown - README: Phase 4 expanded with LLM MLOps items (#94-#97) - CLAUDE.md: AI stack section, updated current phase (M1 shipped / M2 in progress), compose profiles, updated What NOT to do - docs/architecture/overview.md: AI stack section, updated decision flow diagram for Phase 2 LLM pipeline - ADR-0006: updated to reflect external services (path-based, not embedded) - Gitea issues #86-#97 created (M2: AI infra + pipeline; M4: LLM MLOps) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 08:20:44 +00:00
parent faf44c18fc
commit 85367aeaa0
25 changed files with 695 additions and 222 deletions
--- a/README.md
+++ b/README.md
@@ -67,6 +67,53 @@ docs/        architecture, adr, api

 ---

+## AI stack
+
+oO is AI-native: the recommender's job is to **rank**, not to write. An LLM generates candidate tips from the user's context; the bandit picks the best one.
+
+### Three-tier layout
+
+| Tier | Service | Purpose | Where |
+|------|---------|---------|-------|
+| Inference | **Ollama** | Local LLM + embedding; no data leaves the host | `localhost:11434` |
+| Routing | **LiteLLM** | Unified OpenAI-compatible API; model aliases; cloud fallback | `llm.alogins.net` (Agap shared) |
+| Testing | **OpenWebUI** | Prompt iteration, model comparison, manual evals | `ai.alogins.net` (Agap shared) |
+
+### Tip generation pipeline (Phase 2 target)
+
+```
+User signals  ──▶  Context assembler  ──▶  LiteLLM  ──▶  Ollama (local)
+(tasks, calendar,    (ml/features/)         (routing)     or cloud fallback
+ patterns, time)
+                                                ▼
+                                     N typed TipCandidates
+                                     {content, kind, model,
+                                      prompt_version, confidence}
+                                                ▼
+                                    Bandit policy (ml/serving)
+                                    scores + ranks candidates
+                                                ▼
+                                         Best tip shown
+                                                ▼
+                              User reaction (done / snooze / dismiss + dwell)
+                                                ▼
+                              Online bandit update + prompt_version tracking
+```
+
+**Why LiteLLM as gateway:**  All LLM calls use a single `LITELLM_URL` env var. Swapping from qwen2.5 to llama3.2, or routing a fraction to Claude for A/B, is a config change in LiteLLM — zero code change in oO. The model name in `tip_scores` tells you exactly which model produced each tip.
+
+**Why Ollama first:**  Tips contain personal context. Local inference means no user data leaves the host for the inference path. Cloud models (Anthropic, OpenAI) are opt-in fallbacks for evaluation and simulation only, gated behind `ANTHROPIC_API_KEY`.
+
+### Models (planned)
+
+| Alias | Model | Task |
+|-------|-------|------|
+| `tip-generator` | qwen2.5:7b (default) | Generate typed tip candidates from user context |
+| `embedder` | nomic-embed-text | Task clustering, semantic similarity for dedup |
+| `judge` | claude-haiku-4-5 (cloud, eval-only) | Offline sim judge; rates tip quality for A/B |
+
+---
+
 ## Roadmap

 ### Phase 0 — Walking skeleton  *(M0)* ✓ shipped
@@ -102,7 +149,7 @@ Goal: tips are picked, not drawn from a hat — and they arrive at the right mom

 oO is ML-heavy. Without a cockpit, every model change ships blind. This console is the team's single pane for users, signals, features, models, experiments, and tip outcomes — with the ability to *act* on them (revoke a token, replay an event, promote a model, reset a bandit).

-**Framework pick — `apps/admin` on Next.js 15 + Tremor + shadcn/ui.**  Analytics-first UI for an analytics-first product, stays on our existing TS/React/Tailwind stack, reuses `packages/shared-types`, `sdk-js`, and the Auth.js session. Specialized ML tooling (MLflow, Grafana, Marimo) is **embedded** via authenticated reverse-proxy, not re-implemented.
+**Framework pick — `apps/admin` on Next.js 15 + Tremor + shadcn/ui.**  Analytics-first UI for an analytics-first product, stays on our existing TS/React/Tailwind stack, reuses `packages/shared-types`, `sdk-js`, and the Auth.js session. Specialized ML tooling (MLflow, Airflow) runs as **separate external services** linked from the admin shell; Grafana panels are embedded.

 | Layer | Tool | Why |
 |-------|------|-----|
@@ -111,7 +158,8 @@ oO is ML-heavy. Without a cockpit, every model change ships blind. This console
 | CRUD primitives | **[shadcn/ui](https://ui.shadcn.com)** | Copy-paste Radix components; forms, dialogs, command palette |
 | Heavy grids | **[TanStack Table v8](https://tanstack.com/table)** | Sortable / paginated / virtualized tables (events, users, tips) |
 | Extra charts | **[Recharts](https://recharts.org)** / **[visx](https://airbnb.io/visx)** | Fallbacks where Tremor falls short (e.g. force graphs, Sankey) |
-| Model registry | **[MLflow UI](https://mlflow.org)** *(embedded)* | Artifact + run browser; don't re-build |
+| Model registry / experiments | **[MLflow](https://mlflow.org)** *(external — `o.alogins.net/mlflow`)* | Experiment tracking, artifact browser, model registry; own basic-auth |
+| Pipeline orchestration | **[Airflow](https://airflow.apache.org)** *(external — `o.alogins.net/airflow`)* | Batch feature + retraining DAGs; own web-auth |
 | Infra metrics | **[Grafana](https://grafana.com)** *(embedded panels)* | One ops source of truth |
 | Ad-hoc analysis | **[Marimo](https://marimo.io)** reactive notebooks | Python-native for the ML side; launch-out link |
 | AuthZ | `profile.role='admin'` + Next.js middleware | Reuses existing session; no new auth surface |
@@ -130,8 +178,8 @@ oO is ML-heavy. Without a cockpit, every model change ships blind. This console
 5. [x] **User explorer** — list + detail page: identity, consents, integrations, last tip, reward history; revoke-integration + reset-bandit actions
 6. [x] **Event stream viewer** — live tail of `signals.*` with filters by subject/user/time; same UI when the bus swaps to NATS
 7. [x] **Feature store browser** — features sent to `ml/serving` per scoring call; diff across time for a user
-8. [x] **Model registry panel** — embed MLflow UI at `/admin/models`; promote / archive via admin context menu (writes audit-logged)
-9. [x] **Experiment dashboard** — LinUCB per-arm stats (pulls, reward mean, α), cohort compare, bandit reset control
+8. [x] **Model registry panel** — `/admin/models` links out to MLflow (`mlflow.o.alogins.net`); experiment tracking and dataset management in MLflow + Airflow
+9. [x] **MLOps hub** — `/admin/experiments` links to MLflow experiments/models and Airflow DAGs/datasets; bandit reset on Users page
 10. [x] **Recommendation log (explainability)** — per served tip: `(user, features, policy, score, feedback, latency)`; `tip_scores` table, 30-day retention
 11. [x] **Reward analytics** — reaction distribution over time; per-policy compare; slice by `hour_of_day`, `priority`, cohort
 12. [x] **Data quality widget** — missing-feature rate, stale-token rate, daily completeness heatmap
@@ -142,28 +190,69 @@ oO is ML-heavy. Without a cockpit, every model change ships blind. This console

 - [ ] Apple OAuth (deferred to M2)

-### Phase 2 — Multi-source profile & trust  *(M2)*
-Goal: oO knows more than tasks, and users can see/control what we know.
- [ ] Integrations: Google Calendar, Apple Health (web import), generic webhook ingress
- [ ] Unified `Profile` model (identity, preferences, contexts, consents)
- [ ] Timing signals (Page Visibility, Idle Detection, coarse location) — opt-in, transparent
- [ ] Advice library + mixing policy (todo vs advice vs ambient)
- [ ] User-facing data dashboard: what's stored, what's computed, export, delete-by-category
- [ ] Cost/usage observability
+### Phase 2 — AI tips + multi-source signals  *(M2)*
+Goal: tips are AI-generated from user context, not just raw Todoist tasks. Multiple signal sources feed a generalized pipeline. Research-intensive milestone.
+
+**AI infrastructure (unblock everything else):**
+- [ ] `ai` compose profile — Ollama + LiteLLM for local dev; env vars `OLLAMA_URL` / `LITELLM_URL` (#86)
+- [ ] AI gateway — wire `ml/serving` to LiteLLM; model aliases `tip-generator` + `embedder` (#87)
+
+**AI tip generation pipeline:**
+- [ ] Context assembler — user signals + feature store → structured prompt context (`ml/features/context.py`) (#88)
+- [ ] Tip generator endpoint — `POST /generate` in `ml/serving`; LLM → N typed `TipCandidate` objects (#79)
+- [ ] `TipCandidate` shared schema — `{content, kind, source, model, prompt_version, confidence}`; update recommender pipeline (#89)
+- [ ] LLM output validation + retry — JSON schema gate, clarification retry (2×), fallback to task-based (#90)
+- [ ] Prompt versioning — `prompt_version` + `model` columns in `tip_scores`; content-hash invalidation (#91)
+- [ ] LLM tip quality dashboard — reaction breakdown by model / prompt_version in `/admin/reward-analytics` (#92)
+
+**Evaluation & model selection:**
+- [ ] Model benchmark — compare qwen2.5:7b / llama3.2:3b / gemma3:4b via offline sim + LLM judge (#93)
+- [ ] LLM prompt research — persona design, context injection strategies, few-shot examples (#84)
+
+**Pipeline architecture:**
+- [ ] Signal source abstraction — `SignalSource` interface generalizing beyond Todoist (#78)
+- [ ] Generalized recommendation pipeline — candidate → rank → render stages (#80)
+- [ ] Feature registry + user profile builder — centralized features, persistent profiles (#81)
+- [ ] Tip kind system — task, advice, insight, reminder with kind-aware UI + rewards (#82)
+
+**Policy research:**
+- [ ] Next-gen policies — Thompson sampling, neural bandits, hybrid transfer learning (#83)
+
+**Integrations & infra (carried from M1):**
+- [ ] Apple OAuth (#7)
+- [ ] NATS JetStream replacing in-process bus (#21)
+- [ ] Todoist sync via events (#22)
+- [ ] Event schema registry + protobuf CI gate (#54)
+- [ ] Per-user freshness SLAs for features (#61)
+- [ ] CI skeleton (#3), observability (#18), E2E tests (#20)
+
+**Bugs (fix before new features):**
+- [ ] TipFeedback type mismatch (#73)
+- [ ] Todoist token refresh (#74)
+- [ ] Reward fire-and-forget (#75)
+- [ ] Data retention purge (#76)
+- [ ] Port mismatch (#77)

 ### Phase 3 — Native mobile  *(M3)*
 - [ ] iOS app (SwiftUI) with APNs push
 - [ ] Android app (Compose) with FCM push
 - [ ] `notifier` gains APNs + FCM channels, per-device rate limits
 - [ ] Migrate auth from Auth.js to dedicated OIDC provider (trigger from ADR-0004)
+- [ ] Consolidate MLflow + Airflow behind shared OIDC (SSO for all internal services)
 - [ ] Decide-and-deliver scheduler: per-user "is this tip worth interrupting now?" threshold

 ### Phase 4 — MLOps at scale  *(M4)*
- [ ] Prefect/Airflow for batch feature materialization + retraining
- [ ] MLflow registry; shadow → A/B → launch pipeline as first-class
+- [x] Airflow + MLflow deployed as external services (`mlops` compose profile); each with own auth
+- [ ] Write first retraining DAG (Airflow) + first MLflow experiment logging from `ml/serving`
+- [ ] Feature-to-prompt pipeline — nightly Airflow DAG materializes context for LLM; cuts inline latency (#94)
+- [ ] Prompt optimization loop — sim A/B → MLflow experiment → human-approved promotion (#95)
+- [ ] LLM fine-tuning — tip reactions as training signal; LoRA on base model; MLflow tracks runs (#96)
+- [ ] Embedding-based task clustering — `nomic-embed-text` for dedup + user pattern features (#97)
+- [ ] Consolidate MLflow + Airflow auth into shared OIDC provider (tracked as M3 issue #85)
+- [ ] Shadow → A/B → launch pipeline as first-class in MLflow
 - [ ] Online experiments framework: deterministic assignment + bandit policies alongside fixed-split A/B
 - [ ] Cross-user collaborative features (opt-in only); cohort slicing; fairness checks
- [ ] Drift monitoring (feature drift, prediction drift, reward drift); model cards per version
+- [ ] Drift monitoring (feature + prediction + reward drift); model cards per LLM version

 ### Phase 5 — Production hardening  *(M5)*
 - [ ] Audit logging, rotation of provider tokens + internal signing keys