diff --git a/CLAUDE.md b/CLAUDE.md index b5d092e..2672621 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -65,7 +65,7 @@ docs/ architecture notes, ADRs, API specs - One PR = one concern. Conventional-commit prefixes (`feat:`, `fix:`, `chore:`, `docs:`, `refactor:`). - ADRs go in `docs/adr/NNNN-title.md` for any decision that constrains future work. - No secrets in repo. Local dev via `.env.local` (gitignored), prod via the server's secret store (Vaultwarden now; k8s secrets later). -- Compose profiles (`core`, `full`) so devs can run a subset without 16 GB of RAM. +- Compose profiles: `core` (api + web + admin), `full` (adds ml-serving), `mlops` (adds MLflow + Airflow), `ai` (adds Ollama + LiteLLM). Mix as needed. ## Definition of done (per feature) @@ -76,15 +76,38 @@ docs/ architecture notes, ADRs, API specs 5. Deployable via `docker compose up` locally. 6. If it touches user data → a deletion path exists and is tested. +## AI stack + +oO generates tips with an LLM and ranks them with a bandit. All LLM calls route through **LiteLLM** at `llm.alogins.net` using model aliases — swapping models is a config change, not a code change. + +| Alias | Model | Used by | +|-------|-------|---------| +| `tip-generator` | qwen2.5:7b (default) | `ml/serving` tip generation | +| `embedder` | nomic-embed-text | task clustering, dedup | +| `judge` | claude-haiku-4-5 (cloud, eval only) | offline sim | + +Env vars: `LITELLM_URL` (default `http://localhost:4000`), `OLLAMA_URL` (default `http://localhost:11434`). + +Start with: `docker compose --profile ai up` (adds Ollama + LiteLLM locally). In prod both are shared Agap services. + +**LLM tip generation pipeline:** +1. `ml/features/context.py` assembles user signals → structured prompt context +2. `POST /generate` in `ml/serving` calls LiteLLM → returns `TipCandidate[]` +3. Bandit policy in `ml/serving` scores + ranks candidates +4. Best candidate returned as tip; reaction closes the online reward loop + ## Current phase -**Phase 0 — Prototype.** See `README.md` for the phase roadmap and `docs/architecture/` for diagrams. Work is tracked as Gitea milestones + issues on `alvis/oO`. +**M1 shipped. M2 (AI tips) in progress.** See `README.md` for the phase roadmap and `docs/architecture/` for diagrams. Work is tracked as Gitea milestones + issues on `alvis/oO`. + +Active work: AI tip generation pipeline — issues #86–#93 in M2 milestone. ## What NOT to do - Don't copy Todoist's data into our DB. Store the OAuth token + computed features/derivatives we need, fetch raw on demand. -- Don't implement auth by hand. Phase 0 uses **Auth.js** behind an OIDC-shaped boundary (ADR-0004); swap to a dedicated OIDC provider only when mobile ships. -- Don't hardwire a recommender. The "random todo" v0 must live behind the same interface the real ML model will implement (`POST /recommend` → `{tip}`). Swap internals, keep contract. +- Don't implement auth by hand. Auth.js behind an OIDC-shaped boundary (ADR-0004); swap to a dedicated OIDC provider only when mobile ships. +- Don't hardwire a recommender. The contract is `POST /recommend → {tip}`. Swap internals (bandit, LLM, hybrid), keep contract. - Don't replace a policy in one step. New policies deploy shadow-first; promoted only after offline + online agreement with the incumbent (ADR-0002). -- Don't build an admin UI before the user-facing black page is polished. - Don't over-split processes. Extract a service when pressure demands it, not in anticipation (ADR-0003). +- Don't call LLMs directly from application code. All LLM calls go through `ml/serving` (Python) via `LITELLM_URL`. The TS recommender never holds a model name. +- Don't embed MLflow/Airflow/OpenWebUI in the admin panel. They are external services; link out to them. The admin shell links to `o.alogins.net/mlflow`, `/airflow`, `ai.alogins.net`. diff --git a/README.md b/README.md index 7e54f84..40f70f7 100644 --- a/README.md +++ b/README.md @@ -67,6 +67,53 @@ docs/ architecture, adr, api --- +## AI stack + +oO is AI-native: the recommender's job is to **rank**, not to write. An LLM generates candidate tips from the user's context; the bandit picks the best one. + +### Three-tier layout + +| Tier | Service | Purpose | Where | +|------|---------|---------|-------| +| Inference | **Ollama** | Local LLM + embedding; no data leaves the host | `localhost:11434` | +| Routing | **LiteLLM** | Unified OpenAI-compatible API; model aliases; cloud fallback | `llm.alogins.net` (Agap shared) | +| Testing | **OpenWebUI** | Prompt iteration, model comparison, manual evals | `ai.alogins.net` (Agap shared) | + +### Tip generation pipeline (Phase 2 target) + +``` +User signals ──▶ Context assembler ──▶ LiteLLM ──▶ Ollama (local) +(tasks, calendar, (ml/features/) (routing) or cloud fallback + patterns, time) + ▼ + N typed TipCandidates + {content, kind, model, + prompt_version, confidence} + ▼ + Bandit policy (ml/serving) + scores + ranks candidates + ▼ + Best tip shown + ▼ + User reaction (done / snooze / dismiss + dwell) + ▼ + Online bandit update + prompt_version tracking +``` + +**Why LiteLLM as gateway:** All LLM calls use a single `LITELLM_URL` env var. Swapping from qwen2.5 to llama3.2, or routing a fraction to Claude for A/B, is a config change in LiteLLM — zero code change in oO. The model name in `tip_scores` tells you exactly which model produced each tip. + +**Why Ollama first:** Tips contain personal context. Local inference means no user data leaves the host for the inference path. Cloud models (Anthropic, OpenAI) are opt-in fallbacks for evaluation and simulation only, gated behind `ANTHROPIC_API_KEY`. + +### Models (planned) + +| Alias | Model | Task | +|-------|-------|------| +| `tip-generator` | qwen2.5:7b (default) | Generate typed tip candidates from user context | +| `embedder` | nomic-embed-text | Task clustering, semantic similarity for dedup | +| `judge` | claude-haiku-4-5 (cloud, eval-only) | Offline sim judge; rates tip quality for A/B | + +--- + ## Roadmap ### Phase 0 — Walking skeleton *(M0)* ✓ shipped @@ -102,7 +149,7 @@ Goal: tips are picked, not drawn from a hat — and they arrive at the right mom oO is ML-heavy. Without a cockpit, every model change ships blind. This console is the team's single pane for users, signals, features, models, experiments, and tip outcomes — with the ability to *act* on them (revoke a token, replay an event, promote a model, reset a bandit). -**Framework pick — `apps/admin` on Next.js 15 + Tremor + shadcn/ui.** Analytics-first UI for an analytics-first product, stays on our existing TS/React/Tailwind stack, reuses `packages/shared-types`, `sdk-js`, and the Auth.js session. Specialized ML tooling (MLflow, Grafana, Marimo) is **embedded** via authenticated reverse-proxy, not re-implemented. +**Framework pick — `apps/admin` on Next.js 15 + Tremor + shadcn/ui.** Analytics-first UI for an analytics-first product, stays on our existing TS/React/Tailwind stack, reuses `packages/shared-types`, `sdk-js`, and the Auth.js session. Specialized ML tooling (MLflow, Airflow) runs as **separate external services** linked from the admin shell; Grafana panels are embedded. | Layer | Tool | Why | |-------|------|-----| @@ -111,7 +158,8 @@ oO is ML-heavy. Without a cockpit, every model change ships blind. This console | CRUD primitives | **[shadcn/ui](https://ui.shadcn.com)** | Copy-paste Radix components; forms, dialogs, command palette | | Heavy grids | **[TanStack Table v8](https://tanstack.com/table)** | Sortable / paginated / virtualized tables (events, users, tips) | | Extra charts | **[Recharts](https://recharts.org)** / **[visx](https://airbnb.io/visx)** | Fallbacks where Tremor falls short (e.g. force graphs, Sankey) | -| Model registry | **[MLflow UI](https://mlflow.org)** *(embedded)* | Artifact + run browser; don't re-build | +| Model registry / experiments | **[MLflow](https://mlflow.org)** *(external — `o.alogins.net/mlflow`)* | Experiment tracking, artifact browser, model registry; own basic-auth | +| Pipeline orchestration | **[Airflow](https://airflow.apache.org)** *(external — `o.alogins.net/airflow`)* | Batch feature + retraining DAGs; own web-auth | | Infra metrics | **[Grafana](https://grafana.com)** *(embedded panels)* | One ops source of truth | | Ad-hoc analysis | **[Marimo](https://marimo.io)** reactive notebooks | Python-native for the ML side; launch-out link | | AuthZ | `profile.role='admin'` + Next.js middleware | Reuses existing session; no new auth surface | @@ -130,8 +178,8 @@ oO is ML-heavy. Without a cockpit, every model change ships blind. This console 5. [x] **User explorer** — list + detail page: identity, consents, integrations, last tip, reward history; revoke-integration + reset-bandit actions 6. [x] **Event stream viewer** — live tail of `signals.*` with filters by subject/user/time; same UI when the bus swaps to NATS 7. [x] **Feature store browser** — features sent to `ml/serving` per scoring call; diff across time for a user -8. [x] **Model registry panel** — embed MLflow UI at `/admin/models`; promote / archive via admin context menu (writes audit-logged) -9. [x] **Experiment dashboard** — LinUCB per-arm stats (pulls, reward mean, α), cohort compare, bandit reset control +8. [x] **Model registry panel** — `/admin/models` links out to MLflow (`mlflow.o.alogins.net`); experiment tracking and dataset management in MLflow + Airflow +9. [x] **MLOps hub** — `/admin/experiments` links to MLflow experiments/models and Airflow DAGs/datasets; bandit reset on Users page 10. [x] **Recommendation log (explainability)** — per served tip: `(user, features, policy, score, feedback, latency)`; `tip_scores` table, 30-day retention 11. [x] **Reward analytics** — reaction distribution over time; per-policy compare; slice by `hour_of_day`, `priority`, cohort 12. [x] **Data quality widget** — missing-feature rate, stale-token rate, daily completeness heatmap @@ -142,28 +190,69 @@ oO is ML-heavy. Without a cockpit, every model change ships blind. This console - [ ] Apple OAuth (deferred to M2) -### Phase 2 — Multi-source profile & trust *(M2)* -Goal: oO knows more than tasks, and users can see/control what we know. -- [ ] Integrations: Google Calendar, Apple Health (web import), generic webhook ingress -- [ ] Unified `Profile` model (identity, preferences, contexts, consents) -- [ ] Timing signals (Page Visibility, Idle Detection, coarse location) — opt-in, transparent -- [ ] Advice library + mixing policy (todo vs advice vs ambient) -- [ ] User-facing data dashboard: what's stored, what's computed, export, delete-by-category -- [ ] Cost/usage observability +### Phase 2 — AI tips + multi-source signals *(M2)* +Goal: tips are AI-generated from user context, not just raw Todoist tasks. Multiple signal sources feed a generalized pipeline. Research-intensive milestone. + +**AI infrastructure (unblock everything else):** +- [ ] `ai` compose profile — Ollama + LiteLLM for local dev; env vars `OLLAMA_URL` / `LITELLM_URL` (#86) +- [ ] AI gateway — wire `ml/serving` to LiteLLM; model aliases `tip-generator` + `embedder` (#87) + +**AI tip generation pipeline:** +- [ ] Context assembler — user signals + feature store → structured prompt context (`ml/features/context.py`) (#88) +- [ ] Tip generator endpoint — `POST /generate` in `ml/serving`; LLM → N typed `TipCandidate` objects (#79) +- [ ] `TipCandidate` shared schema — `{content, kind, source, model, prompt_version, confidence}`; update recommender pipeline (#89) +- [ ] LLM output validation + retry — JSON schema gate, clarification retry (2×), fallback to task-based (#90) +- [ ] Prompt versioning — `prompt_version` + `model` columns in `tip_scores`; content-hash invalidation (#91) +- [ ] LLM tip quality dashboard — reaction breakdown by model / prompt_version in `/admin/reward-analytics` (#92) + +**Evaluation & model selection:** +- [ ] Model benchmark — compare qwen2.5:7b / llama3.2:3b / gemma3:4b via offline sim + LLM judge (#93) +- [ ] LLM prompt research — persona design, context injection strategies, few-shot examples (#84) + +**Pipeline architecture:** +- [ ] Signal source abstraction — `SignalSource` interface generalizing beyond Todoist (#78) +- [ ] Generalized recommendation pipeline — candidate → rank → render stages (#80) +- [ ] Feature registry + user profile builder — centralized features, persistent profiles (#81) +- [ ] Tip kind system — task, advice, insight, reminder with kind-aware UI + rewards (#82) + +**Policy research:** +- [ ] Next-gen policies — Thompson sampling, neural bandits, hybrid transfer learning (#83) + +**Integrations & infra (carried from M1):** +- [ ] Apple OAuth (#7) +- [ ] NATS JetStream replacing in-process bus (#21) +- [ ] Todoist sync via events (#22) +- [ ] Event schema registry + protobuf CI gate (#54) +- [ ] Per-user freshness SLAs for features (#61) +- [ ] CI skeleton (#3), observability (#18), E2E tests (#20) + +**Bugs (fix before new features):** +- [ ] TipFeedback type mismatch (#73) +- [ ] Todoist token refresh (#74) +- [ ] Reward fire-and-forget (#75) +- [ ] Data retention purge (#76) +- [ ] Port mismatch (#77) ### Phase 3 — Native mobile *(M3)* - [ ] iOS app (SwiftUI) with APNs push - [ ] Android app (Compose) with FCM push - [ ] `notifier` gains APNs + FCM channels, per-device rate limits - [ ] Migrate auth from Auth.js to dedicated OIDC provider (trigger from ADR-0004) +- [ ] Consolidate MLflow + Airflow behind shared OIDC (SSO for all internal services) - [ ] Decide-and-deliver scheduler: per-user "is this tip worth interrupting now?" threshold ### Phase 4 — MLOps at scale *(M4)* -- [ ] Prefect/Airflow for batch feature materialization + retraining -- [ ] MLflow registry; shadow → A/B → launch pipeline as first-class +- [x] Airflow + MLflow deployed as external services (`mlops` compose profile); each with own auth +- [ ] Write first retraining DAG (Airflow) + first MLflow experiment logging from `ml/serving` +- [ ] Feature-to-prompt pipeline — nightly Airflow DAG materializes context for LLM; cuts inline latency (#94) +- [ ] Prompt optimization loop — sim A/B → MLflow experiment → human-approved promotion (#95) +- [ ] LLM fine-tuning — tip reactions as training signal; LoRA on base model; MLflow tracks runs (#96) +- [ ] Embedding-based task clustering — `nomic-embed-text` for dedup + user pattern features (#97) +- [ ] Consolidate MLflow + Airflow auth into shared OIDC provider (tracked as M3 issue #85) +- [ ] Shadow → A/B → launch pipeline as first-class in MLflow - [ ] Online experiments framework: deterministic assignment + bandit policies alongside fixed-split A/B - [ ] Cross-user collaborative features (opt-in only); cohort slicing; fairness checks -- [ ] Drift monitoring (feature drift, prediction drift, reward drift); model cards per version +- [ ] Drift monitoring (feature + prediction + reward drift); model cards per LLM version ### Phase 5 — Production hardening *(M5)* - [ ] Audit logging, rotation of provider tokens + internal signing keys diff --git a/apps/admin/next.config.ts b/apps/admin/next.config.ts index 6e87389..f19cb8f 100644 --- a/apps/admin/next.config.ts +++ b/apps/admin/next.config.ts @@ -1,6 +1,10 @@ import type { NextConfig } from 'next'; +import path from 'node:path'; const nextConfig: NextConfig = { + output: 'standalone', + outputFileTracingRoot: path.join(__dirname, '../../'), + basePath: '/admin', async rewrites() { return [ { diff --git a/apps/admin/src/app/docs/[category]/[slug]/page.tsx b/apps/admin/src/app/docs/[category]/[slug]/page.tsx index b9f35da..2ce91da 100644 --- a/apps/admin/src/app/docs/[category]/[slug]/page.tsx +++ b/apps/admin/src/app/docs/[category]/[slug]/page.tsx @@ -17,14 +17,15 @@ function isDocCategory(value: string): value is DocCategory { export default async function DocDetailPage({ params, }: { - params: { category: string; slug: string }; + params: Promise<{ category: string; slug: string }>; }) { - if (!isDocCategory(params.category)) notFound(); + const { category, slug } = await params; + if (!isDocCategory(category)) notFound(); - const doc = await getDoc(params.category, params.slug); + const doc = await getDoc(category, slug); if (!doc) notFound(); - const categoryLabel = CATEGORY_LABELS[params.category]; + const categoryLabel = CATEGORY_LABELS[category]; return ( diff --git a/apps/admin/src/app/experiments/page.tsx b/apps/admin/src/app/experiments/page.tsx index bb8bb26..6134991 100644 --- a/apps/admin/src/app/experiments/page.tsx +++ b/apps/admin/src/app/experiments/page.tsx @@ -1,124 +1,89 @@ -'use client'; - -import { useEffect, useState } from 'react'; import { AdminShell } from '@/components/AdminShell'; -import { resetBandit } from '@/lib/api'; -interface BanditStats { - user_id: string; - pulls: number; - reward_count: number; - cumulative_reward: number; - estimated_mean_reward: number; - theta: number[]; - last_updated: string | null; -} - -const FEATURE_LABELS = ['hour_sin', 'hour_cos', 'is_overdue', 'task_age', 'priority']; +const mlflowUrl = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow'; +const airflowUrl = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow'; export default function ExperimentsPage() { - const [userId, setUserId] = useState(''); - const [stats, setStats] = useState(null); - const [loading, setLoading] = useState(false); - const [resetting, setResetting] = useState(false); - const [error, setError] = useState(''); - const [resetMsg, setResetMsg] = useState(''); - - const fetchStats = async () => { - if (!userId.trim()) return; - setLoading(true); - setError(''); - try { - const res = await fetch(`/api/ml/stats/${encodeURIComponent(userId.trim())}`, { credentials: 'include' }); - if (!res.ok) throw new Error(res.statusText); - setStats(await res.json()); - } catch (e: any) { - setError(e.message); - } finally { - setLoading(false); - } - }; - - const handleReset = async () => { - if (!userId.trim()) return; - if (!confirm(`Reset LinUCB state for user ${userId}?`)) return; - setResetting(true); - try { - await resetBandit(userId.trim()); - setResetMsg('Bandit state reset.'); - setStats(null); - } catch (e: any) { - setError(e.message); - } finally { - setResetting(false); - } - }; - return (
-

Experiment dashboard

-

LinUCB per-user bandit stats pulled from ml/serving.

+

MLOps

+

+ Experiment tracking, dataset management, and pipeline orchestration live in dedicated external services. + Each has its own auth — see{' '} + MLOps runbook + {' '}for credentials and first-time setup. +

-
- setUserId(e.target.value)} - onKeyDown={(e) => e.key === 'Enter' && fetchStats()} - placeholder="User ID" - className="bg-gray-900 border border-gray-700 rounded px-3 py-1.5 text-sm text-gray-300 w-80" - /> - - {stats && ( - - )} -
- - {error &&

{error}

} - {resetMsg &&

{resetMsg}

} - {loading &&

Loading…

} - - {stats && ( -
- - - - +
+

Experiment tracking

+
+ +
- )} +
- {stats?.theta && ( -
-

θ (learned weight vector)

-
- {stats.theta.map((v, i) => ( -
-
{FEATURE_LABELS[i] ?? `feat_${i}`}
-
0 ? 'text-green-400' : v < 0 ? 'text-red-400' : 'text-gray-400'}`}> - {v.toFixed(4)} -
-
- ))} -
- {stats.last_updated && ( -

Last updated: {stats.last_updated}

- )} +
+

Pipeline orchestration

+
+ +
- )} +
+ +
+

Bandit state ops

+

+ Per-user LinUCB reset is available on the{' '} + Users page + {' '}→ user detail view. +

+
); } -function StatCard({ label, value }: { label: string; value: string | number }) { +function ExternalCard({ title, description, href, label }: { + title: string; + description: string; + href: string; + label: string; +}) { return ( -
-
{label}
-
{value}
+
+
+

{title}

+

{description}

+
+ + {label} +
); } diff --git a/apps/admin/src/app/login/page.tsx b/apps/admin/src/app/login/page.tsx index 6c4afaa..1363d98 100644 --- a/apps/admin/src/app/login/page.tsx +++ b/apps/admin/src/app/login/page.tsx @@ -5,7 +5,7 @@ export default function LoginPage() {

oO Admin

Sign in via the main app first, then return here.

Sign in with Google diff --git a/apps/admin/src/app/models/page.tsx b/apps/admin/src/app/models/page.tsx index 8301fdc..012c768 100644 --- a/apps/admin/src/app/models/page.tsx +++ b/apps/admin/src/app/models/page.tsx @@ -1,30 +1,53 @@ import { AdminShell } from '@/components/AdminShell'; -export default function ModelsPage() { - const mlflowUrl = process.env.NEXT_PUBLIC_MLFLOW_URL ?? 'http://localhost:5000'; +const mlflowUrl = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow'; +export default function ModelsPage() { return ( -
- -

- MLflow is embedded below when running under the full compose profile. - Promote or archive model versions via the MLflow UI; each action writes to the audit log automatically. +

+

Model registry

+

+ Model lifecycle (runs, versions, promotions, artifacts) is managed in MLflow. + Auth is separate — log in with your MLflow credentials.

-
-