feat: MLOps external services, AI stack planning, admin MLOps hub

Infrastructure:
- Add `mlops` compose profile: MLflow (basic-auth, /mlflow path) + Airflow (LocalExecutor, /airflow path) + airflow-db
- infra/mlflow/basic_auth.ini for MLflow auth config
- Caddy routes /mlflow* and /airflow* inside existing o.alogins.net block (see agap_git)
- Dockerfile.admin: NEXT_PUBLIC_MLFLOW_URL / NEXT_PUBLIC_AIRFLOW_URL build args (default /mlflow, /airflow)

Admin panel:
- /admin/models: replace MLflow iframe with external link cards
- /admin/experiments: replace LinUCB stats with MLOps hub (links to MLflow experiments/models + Airflow DAGs/datasets)
- AdminShell: external nav links for MLflow ↗ and Airflow ↗ under MLOps section

Docs & planning:
- README: new AI stack section (Ollama/LiteLLM/OpenWebUI three-tier, tip generation pipeline, model aliases)
- README: Phase 2 expanded with AI infra issues (#86-#93) and granular pipeline breakdown
- README: Phase 4 expanded with LLM MLOps items (#94-#97)
- CLAUDE.md: AI stack section, updated current phase (M1 shipped / M2 in progress), compose profiles, updated What NOT to do
- docs/architecture/overview.md: AI stack section, updated decision flow diagram for Phase 2 LLM pipeline
- ADR-0006: updated to reflect external services (path-based, not embedded)
- Gitea issues #86-#97 created (M2: AI infra + pipeline; M4: LLM MLOps)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-17 08:20:44 +00:00
parent faf44c18fc
commit 85367aeaa0
25 changed files with 695 additions and 222 deletions

View File

@@ -65,7 +65,7 @@ docs/ architecture notes, ADRs, API specs
- One PR = one concern. Conventional-commit prefixes (`feat:`, `fix:`, `chore:`, `docs:`, `refactor:`). - One PR = one concern. Conventional-commit prefixes (`feat:`, `fix:`, `chore:`, `docs:`, `refactor:`).
- ADRs go in `docs/adr/NNNN-title.md` for any decision that constrains future work. - ADRs go in `docs/adr/NNNN-title.md` for any decision that constrains future work.
- No secrets in repo. Local dev via `.env.local` (gitignored), prod via the server's secret store (Vaultwarden now; k8s secrets later). - No secrets in repo. Local dev via `.env.local` (gitignored), prod via the server's secret store (Vaultwarden now; k8s secrets later).
- Compose profiles (`core`, `full`) so devs can run a subset without 16 GB of RAM. - Compose profiles: `core` (api + web + admin), `full` (adds ml-serving), `mlops` (adds MLflow + Airflow), `ai` (adds Ollama + LiteLLM). Mix as needed.
## Definition of done (per feature) ## Definition of done (per feature)
@@ -76,15 +76,38 @@ docs/ architecture notes, ADRs, API specs
5. Deployable via `docker compose up` locally. 5. Deployable via `docker compose up` locally.
6. If it touches user data → a deletion path exists and is tested. 6. If it touches user data → a deletion path exists and is tested.
## AI stack
oO generates tips with an LLM and ranks them with a bandit. All LLM calls route through **LiteLLM** at `llm.alogins.net` using model aliases — swapping models is a config change, not a code change.
| Alias | Model | Used by |
|-------|-------|---------|
| `tip-generator` | qwen2.5:7b (default) | `ml/serving` tip generation |
| `embedder` | nomic-embed-text | task clustering, dedup |
| `judge` | claude-haiku-4-5 (cloud, eval only) | offline sim |
Env vars: `LITELLM_URL` (default `http://localhost:4000`), `OLLAMA_URL` (default `http://localhost:11434`).
Start with: `docker compose --profile ai up` (adds Ollama + LiteLLM locally). In prod both are shared Agap services.
**LLM tip generation pipeline:**
1. `ml/features/context.py` assembles user signals → structured prompt context
2. `POST /generate` in `ml/serving` calls LiteLLM → returns `TipCandidate[]`
3. Bandit policy in `ml/serving` scores + ranks candidates
4. Best candidate returned as tip; reaction closes the online reward loop
## Current phase ## Current phase
**Phase 0 — Prototype.** See `README.md` for the phase roadmap and `docs/architecture/` for diagrams. Work is tracked as Gitea milestones + issues on `alvis/oO`. **M1 shipped. M2 (AI tips) in progress.** See `README.md` for the phase roadmap and `docs/architecture/` for diagrams. Work is tracked as Gitea milestones + issues on `alvis/oO`.
Active work: AI tip generation pipeline — issues #86#93 in M2 milestone.
## What NOT to do ## What NOT to do
- Don't copy Todoist's data into our DB. Store the OAuth token + computed features/derivatives we need, fetch raw on demand. - Don't copy Todoist's data into our DB. Store the OAuth token + computed features/derivatives we need, fetch raw on demand.
- Don't implement auth by hand. Phase 0 uses **Auth.js** behind an OIDC-shaped boundary (ADR-0004); swap to a dedicated OIDC provider only when mobile ships. - Don't implement auth by hand. Auth.js behind an OIDC-shaped boundary (ADR-0004); swap to a dedicated OIDC provider only when mobile ships.
- Don't hardwire a recommender. The "random todo" v0 must live behind the same interface the real ML model will implement (`POST /recommend``{tip}`). Swap internals, keep contract. - Don't hardwire a recommender. The contract is `POST /recommend → {tip}`. Swap internals (bandit, LLM, hybrid), keep contract.
- Don't replace a policy in one step. New policies deploy shadow-first; promoted only after offline + online agreement with the incumbent (ADR-0002). - Don't replace a policy in one step. New policies deploy shadow-first; promoted only after offline + online agreement with the incumbent (ADR-0002).
- Don't build an admin UI before the user-facing black page is polished.
- Don't over-split processes. Extract a service when pressure demands it, not in anticipation (ADR-0003). - Don't over-split processes. Extract a service when pressure demands it, not in anticipation (ADR-0003).
- Don't call LLMs directly from application code. All LLM calls go through `ml/serving` (Python) via `LITELLM_URL`. The TS recommender never holds a model name.
- Don't embed MLflow/Airflow/OpenWebUI in the admin panel. They are external services; link out to them. The admin shell links to `o.alogins.net/mlflow`, `/airflow`, `ai.alogins.net`.

119
README.md
View File

@@ -67,6 +67,53 @@ docs/ architecture, adr, api
--- ---
## AI stack
oO is AI-native: the recommender's job is to **rank**, not to write. An LLM generates candidate tips from the user's context; the bandit picks the best one.
### Three-tier layout
| Tier | Service | Purpose | Where |
|------|---------|---------|-------|
| Inference | **Ollama** | Local LLM + embedding; no data leaves the host | `localhost:11434` |
| Routing | **LiteLLM** | Unified OpenAI-compatible API; model aliases; cloud fallback | `llm.alogins.net` (Agap shared) |
| Testing | **OpenWebUI** | Prompt iteration, model comparison, manual evals | `ai.alogins.net` (Agap shared) |
### Tip generation pipeline (Phase 2 target)
```
User signals ──▶ Context assembler ──▶ LiteLLM ──▶ Ollama (local)
(tasks, calendar, (ml/features/) (routing) or cloud fallback
patterns, time)
N typed TipCandidates
{content, kind, model,
prompt_version, confidence}
Bandit policy (ml/serving)
scores + ranks candidates
Best tip shown
User reaction (done / snooze / dismiss + dwell)
Online bandit update + prompt_version tracking
```
**Why LiteLLM as gateway:** All LLM calls use a single `LITELLM_URL` env var. Swapping from qwen2.5 to llama3.2, or routing a fraction to Claude for A/B, is a config change in LiteLLM — zero code change in oO. The model name in `tip_scores` tells you exactly which model produced each tip.
**Why Ollama first:** Tips contain personal context. Local inference means no user data leaves the host for the inference path. Cloud models (Anthropic, OpenAI) are opt-in fallbacks for evaluation and simulation only, gated behind `ANTHROPIC_API_KEY`.
### Models (planned)
| Alias | Model | Task |
|-------|-------|------|
| `tip-generator` | qwen2.5:7b (default) | Generate typed tip candidates from user context |
| `embedder` | nomic-embed-text | Task clustering, semantic similarity for dedup |
| `judge` | claude-haiku-4-5 (cloud, eval-only) | Offline sim judge; rates tip quality for A/B |
---
## Roadmap ## Roadmap
### Phase 0 — Walking skeleton *(M0)* ✓ shipped ### Phase 0 — Walking skeleton *(M0)* ✓ shipped
@@ -102,7 +149,7 @@ Goal: tips are picked, not drawn from a hat — and they arrive at the right mom
oO is ML-heavy. Without a cockpit, every model change ships blind. This console is the team's single pane for users, signals, features, models, experiments, and tip outcomes — with the ability to *act* on them (revoke a token, replay an event, promote a model, reset a bandit). oO is ML-heavy. Without a cockpit, every model change ships blind. This console is the team's single pane for users, signals, features, models, experiments, and tip outcomes — with the ability to *act* on them (revoke a token, replay an event, promote a model, reset a bandit).
**Framework pick — `apps/admin` on Next.js 15 + Tremor + shadcn/ui.** Analytics-first UI for an analytics-first product, stays on our existing TS/React/Tailwind stack, reuses `packages/shared-types`, `sdk-js`, and the Auth.js session. Specialized ML tooling (MLflow, Grafana, Marimo) is **embedded** via authenticated reverse-proxy, not re-implemented. **Framework pick — `apps/admin` on Next.js 15 + Tremor + shadcn/ui.** Analytics-first UI for an analytics-first product, stays on our existing TS/React/Tailwind stack, reuses `packages/shared-types`, `sdk-js`, and the Auth.js session. Specialized ML tooling (MLflow, Airflow) runs as **separate external services** linked from the admin shell; Grafana panels are embedded.
| Layer | Tool | Why | | Layer | Tool | Why |
|-------|------|-----| |-------|------|-----|
@@ -111,7 +158,8 @@ oO is ML-heavy. Without a cockpit, every model change ships blind. This console
| CRUD primitives | **[shadcn/ui](https://ui.shadcn.com)** | Copy-paste Radix components; forms, dialogs, command palette | | CRUD primitives | **[shadcn/ui](https://ui.shadcn.com)** | Copy-paste Radix components; forms, dialogs, command palette |
| Heavy grids | **[TanStack Table v8](https://tanstack.com/table)** | Sortable / paginated / virtualized tables (events, users, tips) | | Heavy grids | **[TanStack Table v8](https://tanstack.com/table)** | Sortable / paginated / virtualized tables (events, users, tips) |
| Extra charts | **[Recharts](https://recharts.org)** / **[visx](https://airbnb.io/visx)** | Fallbacks where Tremor falls short (e.g. force graphs, Sankey) | | Extra charts | **[Recharts](https://recharts.org)** / **[visx](https://airbnb.io/visx)** | Fallbacks where Tremor falls short (e.g. force graphs, Sankey) |
| Model registry | **[MLflow UI](https://mlflow.org)** *(embedded)* | Artifact + run browser; don't re-build | | Model registry / experiments | **[MLflow](https://mlflow.org)** *(external — `o.alogins.net/mlflow`)* | Experiment tracking, artifact browser, model registry; own basic-auth |
| Pipeline orchestration | **[Airflow](https://airflow.apache.org)** *(external — `o.alogins.net/airflow`)* | Batch feature + retraining DAGs; own web-auth |
| Infra metrics | **[Grafana](https://grafana.com)** *(embedded panels)* | One ops source of truth | | Infra metrics | **[Grafana](https://grafana.com)** *(embedded panels)* | One ops source of truth |
| Ad-hoc analysis | **[Marimo](https://marimo.io)** reactive notebooks | Python-native for the ML side; launch-out link | | Ad-hoc analysis | **[Marimo](https://marimo.io)** reactive notebooks | Python-native for the ML side; launch-out link |
| AuthZ | `profile.role='admin'` + Next.js middleware | Reuses existing session; no new auth surface | | AuthZ | `profile.role='admin'` + Next.js middleware | Reuses existing session; no new auth surface |
@@ -130,8 +178,8 @@ oO is ML-heavy. Without a cockpit, every model change ships blind. This console
5. [x] **User explorer** — list + detail page: identity, consents, integrations, last tip, reward history; revoke-integration + reset-bandit actions 5. [x] **User explorer** — list + detail page: identity, consents, integrations, last tip, reward history; revoke-integration + reset-bandit actions
6. [x] **Event stream viewer** — live tail of `signals.*` with filters by subject/user/time; same UI when the bus swaps to NATS 6. [x] **Event stream viewer** — live tail of `signals.*` with filters by subject/user/time; same UI when the bus swaps to NATS
7. [x] **Feature store browser** — features sent to `ml/serving` per scoring call; diff across time for a user 7. [x] **Feature store browser** — features sent to `ml/serving` per scoring call; diff across time for a user
8. [x] **Model registry panel**embed MLflow UI at `/admin/models`; promote / archive via admin context menu (writes audit-logged) 8. [x] **Model registry panel**`/admin/models` links out to MLflow (`mlflow.o.alogins.net`); experiment tracking and dataset management in MLflow + Airflow
9. [x] **Experiment dashboard** — LinUCB per-arm stats (pulls, reward mean, α), cohort compare, bandit reset control 9. [x] **MLOps hub**`/admin/experiments` links to MLflow experiments/models and Airflow DAGs/datasets; bandit reset on Users page
10. [x] **Recommendation log (explainability)** — per served tip: `(user, features, policy, score, feedback, latency)`; `tip_scores` table, 30-day retention 10. [x] **Recommendation log (explainability)** — per served tip: `(user, features, policy, score, feedback, latency)`; `tip_scores` table, 30-day retention
11. [x] **Reward analytics** — reaction distribution over time; per-policy compare; slice by `hour_of_day`, `priority`, cohort 11. [x] **Reward analytics** — reaction distribution over time; per-policy compare; slice by `hour_of_day`, `priority`, cohort
12. [x] **Data quality widget** — missing-feature rate, stale-token rate, daily completeness heatmap 12. [x] **Data quality widget** — missing-feature rate, stale-token rate, daily completeness heatmap
@@ -142,28 +190,69 @@ oO is ML-heavy. Without a cockpit, every model change ships blind. This console
- [ ] Apple OAuth (deferred to M2) - [ ] Apple OAuth (deferred to M2)
### Phase 2 — Multi-source profile & trust *(M2)* ### Phase 2 — AI tips + multi-source signals *(M2)*
Goal: oO knows more than tasks, and users can see/control what we know. Goal: tips are AI-generated from user context, not just raw Todoist tasks. Multiple signal sources feed a generalized pipeline. Research-intensive milestone.
- [ ] Integrations: Google Calendar, Apple Health (web import), generic webhook ingress
- [ ] Unified `Profile` model (identity, preferences, contexts, consents) **AI infrastructure (unblock everything else):**
- [ ] Timing signals (Page Visibility, Idle Detection, coarse location) — opt-in, transparent - [ ] `ai` compose profile — Ollama + LiteLLM for local dev; env vars `OLLAMA_URL` / `LITELLM_URL` (#86)
- [ ] Advice library + mixing policy (todo vs advice vs ambient) - [ ] AI gateway — wire `ml/serving` to LiteLLM; model aliases `tip-generator` + `embedder` (#87)
- [ ] User-facing data dashboard: what's stored, what's computed, export, delete-by-category
- [ ] Cost/usage observability **AI tip generation pipeline:**
- [ ] Context assembler — user signals + feature store → structured prompt context (`ml/features/context.py`) (#88)
- [ ] Tip generator endpoint — `POST /generate` in `ml/serving`; LLM → N typed `TipCandidate` objects (#79)
- [ ] `TipCandidate` shared schema — `{content, kind, source, model, prompt_version, confidence}`; update recommender pipeline (#89)
- [ ] LLM output validation + retry — JSON schema gate, clarification retry (2×), fallback to task-based (#90)
- [ ] Prompt versioning — `prompt_version` + `model` columns in `tip_scores`; content-hash invalidation (#91)
- [ ] LLM tip quality dashboard — reaction breakdown by model / prompt_version in `/admin/reward-analytics` (#92)
**Evaluation & model selection:**
- [ ] Model benchmark — compare qwen2.5:7b / llama3.2:3b / gemma3:4b via offline sim + LLM judge (#93)
- [ ] LLM prompt research — persona design, context injection strategies, few-shot examples (#84)
**Pipeline architecture:**
- [ ] Signal source abstraction — `SignalSource` interface generalizing beyond Todoist (#78)
- [ ] Generalized recommendation pipeline — candidate → rank → render stages (#80)
- [ ] Feature registry + user profile builder — centralized features, persistent profiles (#81)
- [ ] Tip kind system — task, advice, insight, reminder with kind-aware UI + rewards (#82)
**Policy research:**
- [ ] Next-gen policies — Thompson sampling, neural bandits, hybrid transfer learning (#83)
**Integrations & infra (carried from M1):**
- [ ] Apple OAuth (#7)
- [ ] NATS JetStream replacing in-process bus (#21)
- [ ] Todoist sync via events (#22)
- [ ] Event schema registry + protobuf CI gate (#54)
- [ ] Per-user freshness SLAs for features (#61)
- [ ] CI skeleton (#3), observability (#18), E2E tests (#20)
**Bugs (fix before new features):**
- [ ] TipFeedback type mismatch (#73)
- [ ] Todoist token refresh (#74)
- [ ] Reward fire-and-forget (#75)
- [ ] Data retention purge (#76)
- [ ] Port mismatch (#77)
### Phase 3 — Native mobile *(M3)* ### Phase 3 — Native mobile *(M3)*
- [ ] iOS app (SwiftUI) with APNs push - [ ] iOS app (SwiftUI) with APNs push
- [ ] Android app (Compose) with FCM push - [ ] Android app (Compose) with FCM push
- [ ] `notifier` gains APNs + FCM channels, per-device rate limits - [ ] `notifier` gains APNs + FCM channels, per-device rate limits
- [ ] Migrate auth from Auth.js to dedicated OIDC provider (trigger from ADR-0004) - [ ] Migrate auth from Auth.js to dedicated OIDC provider (trigger from ADR-0004)
- [ ] Consolidate MLflow + Airflow behind shared OIDC (SSO for all internal services)
- [ ] Decide-and-deliver scheduler: per-user "is this tip worth interrupting now?" threshold - [ ] Decide-and-deliver scheduler: per-user "is this tip worth interrupting now?" threshold
### Phase 4 — MLOps at scale *(M4)* ### Phase 4 — MLOps at scale *(M4)*
- [ ] Prefect/Airflow for batch feature materialization + retraining - [x] Airflow + MLflow deployed as external services (`mlops` compose profile); each with own auth
- [ ] MLflow registry; shadow → A/B → launch pipeline as first-class - [ ] Write first retraining DAG (Airflow) + first MLflow experiment logging from `ml/serving`
- [ ] Feature-to-prompt pipeline — nightly Airflow DAG materializes context for LLM; cuts inline latency (#94)
- [ ] Prompt optimization loop — sim A/B → MLflow experiment → human-approved promotion (#95)
- [ ] LLM fine-tuning — tip reactions as training signal; LoRA on base model; MLflow tracks runs (#96)
- [ ] Embedding-based task clustering — `nomic-embed-text` for dedup + user pattern features (#97)
- [ ] Consolidate MLflow + Airflow auth into shared OIDC provider (tracked as M3 issue #85)
- [ ] Shadow → A/B → launch pipeline as first-class in MLflow
- [ ] Online experiments framework: deterministic assignment + bandit policies alongside fixed-split A/B - [ ] Online experiments framework: deterministic assignment + bandit policies alongside fixed-split A/B
- [ ] Cross-user collaborative features (opt-in only); cohort slicing; fairness checks - [ ] Cross-user collaborative features (opt-in only); cohort slicing; fairness checks
- [ ] Drift monitoring (feature drift, prediction drift, reward drift); model cards per version - [ ] Drift monitoring (feature + prediction + reward drift); model cards per LLM version
### Phase 5 — Production hardening *(M5)* ### Phase 5 — Production hardening *(M5)*
- [ ] Audit logging, rotation of provider tokens + internal signing keys - [ ] Audit logging, rotation of provider tokens + internal signing keys

View File

@@ -1,6 +1,10 @@
import type { NextConfig } from 'next'; import type { NextConfig } from 'next';
import path from 'node:path';
const nextConfig: NextConfig = { const nextConfig: NextConfig = {
output: 'standalone',
outputFileTracingRoot: path.join(__dirname, '../../'),
basePath: '/admin',
async rewrites() { async rewrites() {
return [ return [
{ {

View File

@@ -17,14 +17,15 @@ function isDocCategory(value: string): value is DocCategory {
export default async function DocDetailPage({ export default async function DocDetailPage({
params, params,
}: { }: {
params: { category: string; slug: string }; params: Promise<{ category: string; slug: string }>;
}) { }) {
if (!isDocCategory(params.category)) notFound(); const { category, slug } = await params;
if (!isDocCategory(category)) notFound();
const doc = await getDoc(params.category, params.slug); const doc = await getDoc(category, slug);
if (!doc) notFound(); if (!doc) notFound();
const categoryLabel = CATEGORY_LABELS[params.category]; const categoryLabel = CATEGORY_LABELS[category];
return ( return (
<AdminShell> <AdminShell>

View File

@@ -1,124 +1,89 @@
'use client';
import { useEffect, useState } from 'react';
import { AdminShell } from '@/components/AdminShell'; import { AdminShell } from '@/components/AdminShell';
import { resetBandit } from '@/lib/api';
interface BanditStats { const mlflowUrl = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
user_id: string; const airflowUrl = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
pulls: number;
reward_count: number;
cumulative_reward: number;
estimated_mean_reward: number;
theta: number[];
last_updated: string | null;
}
const FEATURE_LABELS = ['hour_sin', 'hour_cos', 'is_overdue', 'task_age', 'priority'];
export default function ExperimentsPage() { export default function ExperimentsPage() {
const [userId, setUserId] = useState('');
const [stats, setStats] = useState<BanditStats | null>(null);
const [loading, setLoading] = useState(false);
const [resetting, setResetting] = useState(false);
const [error, setError] = useState('');
const [resetMsg, setResetMsg] = useState('');
const fetchStats = async () => {
if (!userId.trim()) return;
setLoading(true);
setError('');
try {
const res = await fetch(`/api/ml/stats/${encodeURIComponent(userId.trim())}`, { credentials: 'include' });
if (!res.ok) throw new Error(res.statusText);
setStats(await res.json());
} catch (e: any) {
setError(e.message);
} finally {
setLoading(false);
}
};
const handleReset = async () => {
if (!userId.trim()) return;
if (!confirm(`Reset LinUCB state for user ${userId}?`)) return;
setResetting(true);
try {
await resetBandit(userId.trim());
setResetMsg('Bandit state reset.');
setStats(null);
} catch (e: any) {
setError(e.message);
} finally {
setResetting(false);
}
};
return ( return (
<AdminShell> <AdminShell>
<div className="space-y-6"> <div className="space-y-6">
<h1 className="text-xl font-semibold">Experiment dashboard</h1> <h1 className="text-xl font-semibold">MLOps</h1>
<p className="text-sm text-gray-500">LinUCB per-user bandit stats pulled from ml/serving.</p> <p className="text-sm text-gray-500">
Experiment tracking, dataset management, and pipeline orchestration live in dedicated external services.
Each has its own auth see{' '}
<a href="/admin/docs/ops/mlops" className="text-indigo-400 hover:underline">MLOps runbook</a>
{' '}for credentials and first-time setup.
</p>
<div className="flex gap-2"> <section className="space-y-3">
<input <h2 className="text-sm font-semibold text-gray-400 uppercase tracking-wider">Experiment tracking</h2>
value={userId} <div className="grid gap-3 md:grid-cols-2">
onChange={(e) => setUserId(e.target.value)} <ExternalCard
onKeyDown={(e) => e.key === 'Enter' && fetchStats()} title="Experiments"
placeholder="User ID" description="Training runs · metrics · parameter sweeps · run comparison"
className="bg-gray-900 border border-gray-700 rounded px-3 py-1.5 text-sm text-gray-300 w-80" href={`${mlflowUrl}/#/experiments`}
label="Open in MLflow ↗"
/>
<ExternalCard
title="Registered models"
description="Model versions · stage promotion (Staging → Production) · artifact browser"
href={`${mlflowUrl}/#/models`}
label="Open in MLflow ↗"
/> />
<button onClick={fetchStats} className="bg-indigo-600 hover:bg-indigo-500 text-white rounded px-4 py-1.5 text-sm">
Load
</button>
{stats && (
<button onClick={handleReset} disabled={resetting} className="bg-red-800 hover:bg-red-700 text-white rounded px-4 py-1.5 text-sm disabled:opacity-50">
Reset bandit
</button>
)}
</div> </div>
</section>
{error && <p className="text-red-400 text-sm">{error}</p>} <section className="space-y-3">
{resetMsg && <p className="text-green-400 text-sm">{resetMsg}</p>} <h2 className="text-sm font-semibold text-gray-400 uppercase tracking-wider">Pipeline orchestration</h2>
{loading && <p className="text-gray-500 text-sm">Loading</p>} <div className="grid gap-3 md:grid-cols-2">
<ExternalCard
title="DAGs"
description="Batch feature materialization · retraining pipelines · data quality jobs"
href={`${airflowUrl}/dags`}
label="Open in Airflow ↗"
/>
<ExternalCard
title="Dataset lineage"
description="Pipeline runs · dataset inputs/outputs · data versioning"
href={`${airflowUrl}/datasets`}
label="Open in Airflow ↗"
/>
</div>
</section>
{stats && ( <section className="space-y-2 pt-2 border-t border-gray-800">
<div className="grid grid-cols-2 gap-4 md:grid-cols-4"> <h2 className="text-sm font-semibold text-gray-400 uppercase tracking-wider">Bandit state ops</h2>
<StatCard label="Pulls" value={stats.pulls} /> <p className="text-xs text-gray-500">
<StatCard label="Reward samples" value={stats.reward_count} /> Per-user LinUCB reset is available on the{' '}
<StatCard label="Cumulative reward" value={stats.cumulative_reward.toFixed(2)} /> <a href="/admin/users" className="text-indigo-400 hover:underline">Users page</a>
<StatCard label="Mean reward" value={stats.estimated_mean_reward.toFixed(3)} /> {' '} user detail view.
</div> </p>
)} </section>
{stats?.theta && (
<div className="space-y-2">
<h2 className="text-sm font-medium text-gray-400">θ (learned weight vector)</h2>
<div className="flex gap-3 flex-wrap">
{stats.theta.map((v, i) => (
<div key={i} className="bg-gray-900 border border-gray-800 rounded p-3 text-center min-w-[100px]">
<div className="text-xs text-gray-500 mb-1">{FEATURE_LABELS[i] ?? `feat_${i}`}</div>
<div className={`text-sm font-mono ${v > 0 ? 'text-green-400' : v < 0 ? 'text-red-400' : 'text-gray-400'}`}>
{v.toFixed(4)}
</div>
</div>
))}
</div>
{stats.last_updated && (
<p className="text-xs text-gray-600">Last updated: {stats.last_updated}</p>
)}
</div>
)}
</div> </div>
</AdminShell> </AdminShell>
); );
} }
function StatCard({ label, value }: { label: string; value: string | number }) { function ExternalCard({ title, description, href, label }: {
title: string;
description: string;
href: string;
label: string;
}) {
return ( return (
<div className="bg-gray-900 border border-gray-800 rounded p-4"> <div className="bg-gray-900 border border-gray-800 rounded-lg p-5 flex items-start justify-between gap-4">
<div className="text-xs text-gray-500 mb-1">{label}</div> <div className="space-y-1">
<div className="text-2xl font-semibold text-white">{value}</div> <h2 className="text-sm font-medium text-gray-200">{title}</h2>
<p className="text-xs text-gray-500">{description}</p>
</div>
<a
href={href}
target="_blank"
rel="noreferrer"
className="flex-shrink-0 text-indigo-400 hover:text-indigo-300 text-xs whitespace-nowrap"
>
{label}
</a>
</div> </div>
); );
} }

View File

@@ -5,7 +5,7 @@ export default function LoginPage() {
<h1 className="text-2xl font-semibold">oO Admin</h1> <h1 className="text-2xl font-semibold">oO Admin</h1>
<p className="text-gray-400 text-sm">Sign in via the main app first, then return here.</p> <p className="text-gray-400 text-sm">Sign in via the main app first, then return here.</p>
<a <a
href={`${process.env.NEXT_PUBLIC_WEB_URL ?? 'http://localhost:3079'}/sign-in`} href="/sign-in"
className="inline-block px-4 py-2 bg-white text-black rounded text-sm font-medium hover:bg-gray-200 transition-colors" className="inline-block px-4 py-2 bg-white text-black rounded text-sm font-medium hover:bg-gray-200 transition-colors"
> >
Sign in with Google Sign in with Google

View File

@@ -1,30 +1,53 @@
import { AdminShell } from '@/components/AdminShell'; import { AdminShell } from '@/components/AdminShell';
export default function ModelsPage() { const mlflowUrl = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
const mlflowUrl = process.env.NEXT_PUBLIC_MLFLOW_URL ?? 'http://localhost:5000';
export default function ModelsPage() {
return ( return (
<AdminShell> <AdminShell>
<div className="space-y-4 h-[calc(100vh-4rem)]"> <div className="space-y-6">
<div className="flex items-center justify-between flex-shrink-0">
<h1 className="text-xl font-semibold">Model registry</h1> <h1 className="text-xl font-semibold">Model registry</h1>
<a href={mlflowUrl} target="_blank" rel="noreferrer" className="text-xs text-gray-400 hover:text-white border border-gray-700 rounded px-2 py-1"> <p className="text-sm text-gray-500">
Open MLflow Model lifecycle (runs, versions, promotions, artifacts) is managed in MLflow.
</a> Auth is separate log in with your MLflow credentials.
</div>
<p className="text-sm text-gray-500 flex-shrink-0">
MLflow is embedded below when running under the <code className="text-xs bg-gray-800 px-1 rounded">full</code> compose profile.
Promote or archive model versions via the MLflow UI; each action writes to the audit log automatically.
</p> </p>
<div className="flex-1 rounded border border-gray-800 overflow-hidden" style={{ height: 'calc(100vh - 12rem)' }}> <ExternalCard
<iframe
src={`${mlflowUrl}/#/models`}
className="w-full h-full bg-white"
title="MLflow Model Registry" title="MLflow Model Registry"
sandbox="allow-scripts allow-same-origin allow-forms allow-popups" description="Experiment runs · registered models · version promotion · artifact browser"
href={mlflowUrl}
label="Open MLflow"
/>
<ExternalCard
title="MLflow Experiments"
description="Training runs, metrics, parameters, and comparison across runs"
href={`${mlflowUrl}/#/experiments`}
label="Browse experiments"
/> />
</div>
</div> </div>
</AdminShell> </AdminShell>
); );
} }
function ExternalCard({ title, description, href, label }: {
title: string;
description: string;
href: string;
label: string;
}) {
return (
<div className="bg-gray-900 border border-gray-800 rounded-lg p-5 flex items-start justify-between gap-4">
<div className="space-y-1">
<h2 className="text-sm font-medium text-gray-200">{title}</h2>
<p className="text-xs text-gray-500">{description}</p>
</div>
<a
href={href}
target="_blank"
rel="noreferrer"
className="flex-shrink-0 bg-indigo-600 hover:bg-indigo-500 text-white text-xs rounded px-3 py-1.5 whitespace-nowrap"
>
{label}
</a>
</div>
);
}

View File

@@ -3,10 +3,15 @@ import { UserDetail } from '@/components/UserDetail';
export const dynamic = 'force-dynamic'; export const dynamic = 'force-dynamic';
export default function UserDetailPage({ params }: { params: { id: string } }) { export default async function UserDetailPage({
params,
}: {
params: Promise<{ id: string }>;
}) {
const { id } = await params;
return ( return (
<AdminShell> <AdminShell>
<UserDetail userId={params.id} /> <UserDetail userId={id} />
</AdminShell> </AdminShell>
); );
} }

View File

@@ -3,14 +3,21 @@
import Link from 'next/link'; import Link from 'next/link';
import { usePathname } from 'next/navigation'; import { usePathname } from 'next/navigation';
const NAV = [ const mlflowUrl = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
const airflowUrl = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
type NavItem =
| { href: string; label: string; external?: false }
| { href: string; label: string; external: true };
const NAV: NavItem[] = [
{ href: '/', label: 'Overview' }, { href: '/', label: 'Overview' },
{ href: '/users', label: 'Users' }, { href: '/users', label: 'Users' },
{ href: '/events', label: 'Events' }, { href: '/events', label: 'Events' },
{ href: '/features', label: 'Features' }, { href: '/features', label: 'Features' },
{ href: '/tips', label: 'Rec log' }, { href: '/tips', label: 'Rec log' },
{ href: '/reward-analytics', label: 'Rewards' }, { href: '/reward-analytics', label: 'Rewards' },
{ href: '/experiments', label: 'Experiments' }, { href: '/experiments', label: 'MLOps' },
{ href: '/simulations', label: 'Simulations' }, { href: '/simulations', label: 'Simulations' },
{ href: '/models', label: 'Models' }, { href: '/models', label: 'Models' },
{ href: '/data-quality', label: 'Data quality' }, { href: '/data-quality', label: 'Data quality' },
@@ -21,6 +28,11 @@ const NAV = [
{ href: '/docs', label: 'Docs' }, { href: '/docs', label: 'Docs' },
]; ];
const NAV_EXTERNAL: NavItem[] = [
{ href: mlflowUrl, label: 'MLflow ↗', external: true },
{ href: airflowUrl, label: 'Airflow ↗', external: true },
];
export function AdminShell({ children }: { children: React.ReactNode }) { export function AdminShell({ children }: { children: React.ReactNode }) {
const pathname = usePathname(); const pathname = usePathname();
return ( return (
@@ -33,7 +45,7 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
Admin Admin
</span> </span>
</div> </div>
<nav className="flex-1 px-2 py-3 space-y-0.5"> <nav className="flex-1 px-2 py-3 space-y-0.5 overflow-y-auto">
{NAV.map(({ href, label }) => { {NAV.map(({ href, label }) => {
const active = href === '/' ? pathname === '/' : pathname.startsWith(href); const active = href === '/' ? pathname === '/' : pathname.startsWith(href);
return ( return (
@@ -50,6 +62,20 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
</Link> </Link>
); );
})} })}
<div className="pt-3 pb-1 px-3">
<span className="text-xs text-gray-600 uppercase tracking-wider font-medium">MLOps</span>
</div>
{NAV_EXTERNAL.map(({ href, label }) => (
<a
key={href}
href={href}
target="_blank"
rel="noreferrer"
className="flex items-center px-3 py-2 rounded text-sm text-gray-500 hover:text-white hover:bg-gray-900 transition-colors"
>
{label}
</a>
))}
</nav> </nav>
</aside> </aside>
{/* Main content */} {/* Main content */}

View File

@@ -16,9 +16,13 @@ export async function middleware(req: NextRequest) {
return NextResponse.redirect(url); return NextResponse.redirect(url);
} }
// Verify admin role via API. The API is same-origin in production (Caddy routes // Verify admin role via API. INTERNAL_API_URL (e.g. http://api:3078) is preferred
// /api/* to the Express service), so we use the rewrite target in dev. // when set — it points to the API service on the internal Docker network, avoiding
const apiBase = process.env.NEXT_PUBLIC_API_URL ?? 'http://localhost:3078'; // a Caddy round-trip. Falls back to NEXT_PUBLIC_API_URL for dev, or localhost.
const apiBase =
process.env.INTERNAL_API_URL ||
process.env.NEXT_PUBLIC_API_URL ||
'http://localhost:3078';
try { try {
const profile = await fetch(`${apiBase}/api/user/me`, { const profile = await fetch(`${apiBase}/api/user/me`, {
headers: { cookie: `sid=${sid}` }, headers: { cookie: `sid=${sid}` },
@@ -41,5 +45,5 @@ export async function middleware(req: NextRequest) {
} }
export const config = { export const config = {
matcher: ['/((?!_next/static|_next/image|favicon.ico).*)'], matcher: ['/', '/((?!_next/static|_next/image|favicon.ico).*)'],
}; };

View File

@@ -1,6 +1,9 @@
import type { NextConfig } from 'next'; import type { NextConfig } from 'next';
import path from 'node:path';
const nextConfig: NextConfig = { const nextConfig: NextConfig = {
output: 'standalone',
outputFileTracingRoot: path.join(__dirname, '../../'),
async rewrites() { async rewrites() {
return [ return [
{ {

File diff suppressed because one or more lines are too long

View File

@@ -28,15 +28,16 @@ Same stack as `apps/web`. Reuses `packages/shared-types`, the Auth.js session co
| Heavy grids | **TanStack Table v8** | Sortable / paginated / virtualized tables for events, users, tips. | | Heavy grids | **TanStack Table v8** | Sortable / paginated / virtualized tables for events, users, tips. |
| Extra charts | **Recharts** | Fallback where Tremor falls short (histograms, distributions). | | Extra charts | **Recharts** | Fallback where Tremor falls short (histograms, distributions). |
### Embed, don't rebuild ### Link out, don't embed
Specialized tooling is **reverse-proxied into the admin shell**, not reimplemented: Specialized MLOps tooling runs as **separate external services** with their own auth, linked from the admin shell — not embedded or reimplemented:
- **MLflow UI** → `/admin/models` (Caddy sub-path proxy) - **MLflow** → `https://o.alogins.net/mlflow` — experiment tracking, model registry, artifact browser; own basic-auth for now; see M3 for SSO consolidation
- **Grafana panels** → `/admin/infra` (iframed or embedded panels) - **Airflow** → `https://o.alogins.net/airflow` — batch pipeline orchestration, dataset management; own web-auth for now
- **Grafana panels** → `/admin/infra` (iframed panels) — infra metrics
- **Marimo notebooks** → launch-out link from admin - **Marimo notebooks** → launch-out link from admin
This prevents reimplementing artifact browsers or graph renderers we'd never do as well. The admin shell links to these services; clicking them opens a new tab. The `/experiments` and `/models` admin pages are hub pages with direct links to the relevant MLflow/Airflow views.
### AuthZ ### AuthZ
@@ -55,5 +56,7 @@ This prevents reimplementing artifact browsers or graph renderers we'd never do
- One more Next.js app in the monorepo. Build/dev added to Turborepo. - One more Next.js app in the monorepo. Build/dev added to Turborepo.
- Tremor + shadcn/ui are added as dependencies. shadcn components are copied into `apps/admin/src/components/ui/` — no runtime version coupling. - Tremor + shadcn/ui are added as dependencies. shadcn components are copied into `apps/admin/src/components/ui/` — no runtime version coupling.
- MLflow and Grafana must be reachable from the Caddy reverse proxy; they are not embedded in the JS bundle. - MLflow (`o.alogins.net/mlflow*` → port 5000) and Airflow (`o.alogins.net/airflow*` → port 8080) are path-based routes in the existing `o.alogins.net` Caddy block, started via `docker compose --profile mlops up`.
- Each service manages its own auth (MLflow: built-in basic-auth; Airflow: built-in web UI auth). M3 will consolidate both behind the shared OIDC provider.
- The `NEXT_PUBLIC_MLFLOW_URL` and `NEXT_PUBLIC_AIRFLOW_URL` build args in `Dockerfile.admin` default to the production URLs; override for dev builds.
- `admin_actions` audit log grows unboundedly — needs a retention policy before M4. - `admin_actions` audit log grows unboundedly — needs a retention policy before M4.

View File

@@ -46,21 +46,42 @@ User reactions (done / snooze / dismiss) are events too. They close the loop as
- **Protobuf** for event schemas with a schema registry (ADR-0005) — train/serve parity depends on this. - **Protobuf** for event schemas with a schema registry (ADR-0005) — train/serve parity depends on this.
- **OpenAPI** for HTTP; TS client auto-generated; Python pydantic hand-written while consumers are few. - **OpenAPI** for HTTP; TS client auto-generated; Python pydantic hand-written while consumers are few.
- **Feast** for feature store when we get there; homegrown adapter until then (Phase 1 seam). - **Feast** for feature store when we get there; homegrown adapter until then (Phase 1 seam).
- **MLflow** for model registry; artifacts in MinIO/S3. - **MLflow** for model registry and experiment tracking; deployed at `o.alogins.net/mlflow`.
- **Airflow** for batch pipelines; deployed at `o.alogins.net/airflow`.
- **Auth.js** embedded behind an OIDC-shaped boundary (ADR-0004). Swap to a standalone OIDC provider when mobile ships. - **Auth.js** embedded behind an OIDC-shaped boundary (ADR-0004). Swap to a standalone OIDC provider when mobile ships.
- **k3s** as the first step beyond docker-compose — no "compose → full k8s" cliff. - **k3s** as the first step beyond docker-compose — no "compose → full k8s" cliff.
## Decision flow for a new tip ## AI stack
All LLM inference routes through **LiteLLM** (`llm.alogins.net`) backed by **Ollama** (local, `localhost:11434`). This means:
- Model aliases (`tip-generator`, `embedder`, `judge`) decouple code from model names.
- Swapping qwen2.5 → llama3.2 = one-line config change in LiteLLM, zero code change in oO.
- Cloud fallback (Anthropic) is opt-in and gated behind `ANTHROPIC_API_KEY` — used only in offline simulation.
**OpenWebUI** (`ai.alogins.net`) is the human-facing interface for prompt iteration and model testing during development.
## Decision flow for a new tip (Phase 2 target)
``` ```
client ─► gateway ─► recommender client ─► gateway ─► recommender (TS)
├─► candidates: integrations.fetchCandidates(user) + advice.library
├─► context: FeatureAssembler(user, request) ml/serving (Python)
├─► policy: PolicyRegistry.get(policyName).pick(candidates, context)
├─► shadows: run shadow policies in parallel, log their picks ├─► context: ml/features/context.py
└─► persist: TipInstance{context_snapshot, policy, tip} (tasks + reactions + time patterns → prompt)
◄─ tip
├─► generate: LiteLLM → Ollama
│ → N TipCandidates {content, kind, model, prompt_version}
├─► score: bandit policy scores each candidate
├─► shadows: shadow policies log picks without serving
└─► persist: tip_scores {candidate, policy, features, latency}
◄─ best TipCandidate
``` ```
Feedback travels back the same path: `POST /feedback → events.emit(feedback.reaction)` → pipelines consume → bandit/model updated on next retrain. **Phase 1 (current):** candidates come from Todoist task list, no LLM. The bandit scores tasks directly.
Feedback: `POST /feedback → events.emit(reaction)` → online bandit update + `prompt_version` tracked for A/B analysis.

View File

@@ -0,0 +1,32 @@
FROM node:22-alpine AS base
RUN npm install -g pnpm
FROM base AS deps
WORKDIR /app
COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
COPY packages/shared-types/package.json ./packages/shared-types/
COPY apps/admin/package.json ./apps/admin/
RUN pnpm install --frozen-lockfile
FROM base AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules
COPY --from=deps /app/apps/admin/node_modules ./apps/admin/node_modules
COPY tsconfig.base.json ./
COPY packages/shared-types ./packages/shared-types
COPY apps/admin ./apps/admin
RUN pnpm --filter @oo/shared-types build
ARG NEXT_PUBLIC_MLFLOW_URL=/mlflow
ARG NEXT_PUBLIC_AIRFLOW_URL=/airflow
ENV NEXT_TELEMETRY_DISABLED=1 \
NEXT_PUBLIC_MLFLOW_URL=$NEXT_PUBLIC_MLFLOW_URL \
NEXT_PUBLIC_AIRFLOW_URL=$NEXT_PUBLIC_AIRFLOW_URL
RUN pnpm --filter @oo/admin build
FROM node:22-alpine AS runner
ENV NODE_ENV=production NEXT_TELEMETRY_DISABLED=1 PORT=3080
WORKDIR /app
COPY --from=builder /app/apps/admin/.next/standalone ./
COPY --from=builder /app/apps/admin/.next/static ./apps/admin/.next/static
CMD ["node", "apps/admin/server.js"]

View File

@@ -22,7 +22,7 @@ RUN pnpm --filter @oo/api build
FROM node:22-alpine AS runner FROM node:22-alpine AS runner
WORKDIR /app WORKDIR /app
RUN npm install -g pnpm RUN npm install -g pnpm
COPY package.json pnpm-workspace.yaml ./ COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
COPY packages/shared-types/package.json ./packages/shared-types/ COPY packages/shared-types/package.json ./packages/shared-types/
COPY services/api/package.json ./services/api/ COPY services/api/package.json ./services/api/
RUN pnpm install --prod --frozen-lockfile RUN pnpm install --prod --frozen-lockfile

View File

@@ -10,15 +10,13 @@ services:
profiles: [core, full] profiles: [core, full]
env_file: ../../.env.local env_file: ../../.env.local
environment: environment:
DATABASE_PATH: /data/oo.db
PORT: "3001"
NODE_ENV: production NODE_ENV: production
volumes: volumes:
- api-data:/data - /mnt/ssd/dbs/oo:/mnt/ssd/dbs/oo
ports: ports:
- "3001:3001" - "127.0.0.1:3078:3078"
healthcheck: healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3001/health"] test: ["CMD", "wget", "--spider", "-q", "http://localhost:3078/health"]
interval: 10s interval: 10s
timeout: 5s timeout: 5s
retries: 5 retries: 5
@@ -30,9 +28,30 @@ services:
profiles: [core, full] profiles: [core, full]
env_file: ../../.env.local env_file: ../../.env.local
environment: environment:
NEXT_PUBLIC_API_URL: "" # rewrites proxy to /api, no cross-origin needed in prod NODE_ENV: production
PORT: "3079"
HOSTNAME: "0.0.0.0"
NEXT_PUBLIC_API_URL: "" # Caddy routes /api/* directly to the API in prod
ports: ports:
- "3000:3000" - "127.0.0.1:3079:3079"
depends_on:
api:
condition: service_healthy
admin:
build:
context: ../..
dockerfile: infra/docker/Dockerfile.admin
profiles: [core, full]
env_file: ../../.env.local
environment:
NODE_ENV: production
PORT: "3080"
HOSTNAME: "0.0.0.0"
NEXT_PUBLIC_API_URL: ""
INTERNAL_API_URL: "http://api:3078"
ports:
- "127.0.0.1:3080:3080"
depends_on: depends_on:
api: api:
condition: service_healthy condition: service_healthy
@@ -45,12 +64,117 @@ services:
dockerfile: infra/docker/Dockerfile.ml dockerfile: infra/docker/Dockerfile.ml
profiles: [full] profiles: [full]
ports: ports:
- "8000:8000" - "127.0.0.1:8000:8000"
healthcheck: healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"] test: ["CMD", "wget", "--spider", "-q", "http://localhost:8000/health"]
interval: 10s interval: 10s
timeout: 5s timeout: 5s
retries: 5 retries: 5
volumes: # ── mlops profile — MLflow + Airflow ──────────────────────────────────────
api-data: # Start: docker compose --profile mlops up
# MLflow UI: http://localhost:5000 or https://o.alogins.net/mlflow (admin / password — change via basic_auth.ini)
# Airflow UI: http://localhost:8080/airflow or https://o.alogins.net/airflow (admin / AIRFLOW_ADMIN_PASSWORD)
# Caddy routes /mlflow* and /airflow* inside the o.alogins.net block
airflow-db:
image: postgres:16-alpine
profiles: [mlops]
environment:
POSTGRES_DB: airflow
POSTGRES_USER: airflow
POSTGRES_PASSWORD: ${AIRFLOW_DB_PASSWORD:-airflow}
volumes:
- /mnt/ssd/dbs/oo/airflow-db:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U airflow"]
interval: 10s
timeout: 5s
retries: 5
airflow-init:
image: apache/airflow:2.9.3
profiles: [mlops]
entrypoint: /bin/bash
command:
- -c
- |
airflow db migrate
airflow users create \
--username admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@oo.local \
--password "$${AIRFLOW_ADMIN_PASSWORD:-admin}"
environment:
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${AIRFLOW_DB_PASSWORD:-airflow}@airflow-db/airflow
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY:-change-me-in-prod}
AIRFLOW__WEBSERVER__BASE_URL: ${AIRFLOW_BASE_URL:-https://o.alogins.net/airflow}
depends_on:
airflow-db:
condition: service_healthy
restart: "no"
airflow-webserver:
image: apache/airflow:2.9.3
profiles: [mlops]
command: webserver
environment:
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${AIRFLOW_DB_PASSWORD:-airflow}@airflow-db/airflow
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY:-change-me-in-prod}
AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
AIRFLOW__WEBSERVER__BASE_URL: ${AIRFLOW_BASE_URL:-https://o.alogins.net/airflow}
volumes:
- ../../ml/pipelines:/opt/airflow/dags:ro
ports:
- "127.0.0.1:8080:8080"
depends_on:
airflow-init:
condition: service_completed_successfully
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 60s
airflow-scheduler:
image: apache/airflow:2.9.3
profiles: [mlops]
command: scheduler
environment:
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${AIRFLOW_DB_PASSWORD:-airflow}@airflow-db/airflow
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
volumes:
- ../../ml/pipelines:/opt/airflow/dags:ro
depends_on:
airflow-init:
condition: service_completed_successfully
mlflow:
image: ghcr.io/mlflow/mlflow:2.14.3
profiles: [mlops]
command: >
mlflow server
--backend-store-uri sqlite:////mlflow/mlflow.db
--default-artifact-root /mlflow/artifacts
--host 0.0.0.0
--port 5000
--app-name basic-auth
--static-prefix /mlflow
environment:
MLFLOW_AUTH_CONFIG_PATH: /mlflow/basic_auth.ini
volumes:
- /mnt/ssd/dbs/oo/mlflow:/mlflow
- ../../infra/mlflow/basic_auth.ini:/mlflow/basic_auth.ini:ro
ports:
- "127.0.0.1:5000:5000"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:5000/health"]
interval: 10s
timeout: 5s
retries: 5

View File

@@ -0,0 +1,6 @@
[mlflow]
default_permission = NO_PERMISSIONS
database_uri = sqlite:////mlflow/basic_auth.db
admin_username = admin
# Change this before deploying — the admin can reset other users' passwords via the MLflow UI
admin_password = password

11
pnpm-lock.yaml generated
View File

@@ -45,9 +45,6 @@ importers:
specifier: ^2.15.3 specifier: ^2.15.3
version: 2.15.4(react-dom@19.2.5(react@19.2.5))(react@19.2.5) version: 2.15.4(react-dom@19.2.5(react@19.2.5))(react@19.2.5)
devDependencies: devDependencies:
'@types/marked':
specifier: ^6.0.0
version: 6.0.0
'@types/node': '@types/node':
specifier: ^22.10.5 specifier: ^22.10.5
version: 22.19.17 version: 22.19.17
@@ -1335,10 +1332,6 @@ packages:
'@types/http-errors@2.0.5': '@types/http-errors@2.0.5':
resolution: {integrity: sha512-r8Tayk8HJnX0FztbZN7oVqGccWgw98T/0neJphO91KkmOzug1KkofZURD4UaD5uH8AqcFLfdPErnBod0u71/qg==} resolution: {integrity: sha512-r8Tayk8HJnX0FztbZN7oVqGccWgw98T/0neJphO91KkmOzug1KkofZURD4UaD5uH8AqcFLfdPErnBod0u71/qg==}
'@types/marked@6.0.0':
resolution: {integrity: sha512-jmjpa4BwUsmhxcfsgUit/7A9KbrC48Q0q8KvnY107ogcjGgTFDlIL3RpihNpx2Mu1hM4mdFQjoVc4O6JoGKHsA==}
deprecated: This is a stub types definition. marked provides its own type definitions, so you do not need this installed.
'@types/node@22.19.17': '@types/node@22.19.17':
resolution: {integrity: sha512-wGdMcf+vPYM6jikpS/qhg6WiqSV/OhG+jeeHT/KlVqxYfD40iYJf9/AE1uQxVWFvU7MipKRkRv8NSHiCGgPr8Q==} resolution: {integrity: sha512-wGdMcf+vPYM6jikpS/qhg6WiqSV/OhG+jeeHT/KlVqxYfD40iYJf9/AE1uQxVWFvU7MipKRkRv8NSHiCGgPr8Q==}
@@ -3817,10 +3810,6 @@ snapshots:
'@types/http-errors@2.0.5': {} '@types/http-errors@2.0.5': {}
'@types/marked@6.0.0':
dependencies:
marked: 14.1.4
'@types/node@22.19.17': '@types/node@22.19.17':
dependencies: dependencies:
undici-types: 6.21.0 undici-types: 6.21.0

View File

@@ -14,7 +14,7 @@ function optional(name: string, fallback: string): string {
} }
export const config = { export const config = {
PORT: parseInt(optional('PORT', '3078'), 10), PORT: parseInt(optional('PORT', '3001'), 10),
NODE_ENV: optional('NODE_ENV', 'development'), NODE_ENV: optional('NODE_ENV', 'development'),
DATABASE_PATH: optional('DATABASE_PATH', './data/oo.db'), DATABASE_PATH: optional('DATABASE_PATH', './data/oo.db'),

View File

@@ -22,12 +22,27 @@ export type TipServedEvent = {
export type TipFeedbackEvent = { export type TipFeedbackEvent = {
userId: string; userId: string;
tipId: string; tipId: string;
action: 'done' | 'dismiss' | 'snooze'; action: 'done' | 'dismiss' | 'snooze' | 'helpful' | 'not_helpful';
reward: number; // inferred from action + dwellMs (see inferReward in recommender.ts) reward: number; // inferred from action + dwellMs (see inferReward in recommender.ts)
dwellMs: number | null; dwellMs: number | null;
createdAt: string; createdAt: string;
}; };
export type IntegrationTokenExpiredEvent = {
userId: string;
provider: string;
detectedAt: string;
};
export type RewardDeliveryFailedEvent = {
userId: string;
tipId: string;
reward: number;
attempts: number;
error: string;
failedAt: string;
};
export type TaskSyncedEvent = { export type TaskSyncedEvent = {
userId: string; userId: string;
count: number; count: number;
@@ -37,7 +52,9 @@ export type TaskSyncedEvent = {
type EventMap = { type EventMap = {
'signals.tip.served': TipServedEvent; 'signals.tip.served': TipServedEvent;
'signals.tip.feedback': TipFeedbackEvent; 'signals.tip.feedback': TipFeedbackEvent;
'signals.tip.reward_failed': RewardDeliveryFailedEvent;
'signals.task.synced': TaskSyncedEvent; 'signals.task.synced': TaskSyncedEvent;
'signals.integration.token_expired': IntegrationTokenExpiredEvent;
}; };
export type StoredEvent = { export type StoredEvent = {

View File

@@ -3,7 +3,9 @@ import express from 'express';
import cookieParser from 'cookie-parser'; import cookieParser from 'cookie-parser';
import cors from 'cors'; import cors from 'cors';
import { config } from './config.js'; import { config } from './config.js';
import { runMigrations } from './db/index.js'; import { db, runMigrations } from './db/index.js';
import { tipScores, tipFeedback } from './db/schema.js';
import { lt } from 'drizzle-orm';
import { sessionMiddleware } from './middleware/session.js'; import { sessionMiddleware } from './middleware/session.js';
import { authRouter } from './routes/auth.js'; import { authRouter } from './routes/auth.js';
import { integrationsRouter } from './routes/integrations.js'; import { integrationsRouter } from './routes/integrations.js';
@@ -20,6 +22,15 @@ import type { Request, Response } from 'express';
await mkdir(dirname(config.DATABASE_PATH), { recursive: true }); await mkdir(dirname(config.DATABASE_PATH), { recursive: true });
runMigrations(); runMigrations();
// Keep the API alive on stray async faults (e.g. a single bad admin route)
// rather than dropping the whole process.
process.on('unhandledRejection', (reason) => {
console.error('[api] unhandledRejection', reason);
});
process.on('uncaughtException', (err) => {
console.error('[api] uncaughtException', err);
});
const app = express(); const app = express();
app.use( app.use(
@@ -61,6 +72,19 @@ app.use('/api/ml', requireAuth as any, requireAdmin as any, async (req: Request,
} }
}); });
async function purgeExpiredData() {
const cutoff = new Date(Date.now() - 30 * 24 * 60 * 60 * 1000).toISOString();
try {
await db.delete(tipScores).where(lt(tipScores.servedAt, cutoff));
await db.delete(tipFeedback).where(lt(tipFeedback.createdAt, cutoff));
} catch (err: any) {
console.error(`[purge] retention cleanup failed: ${err.message}`);
}
}
purgeExpiredData();
setInterval(purgeExpiredData, 24 * 60 * 60 * 1000);
app.listen(config.PORT, () => { app.listen(config.PORT, () => {
console.log(`oO API listening on http://localhost:${config.PORT}`); console.log(`oO API listening on http://localhost:${config.PORT}`);
}); });

View File

@@ -368,7 +368,7 @@ router.get('/reward-analytics', async (req: AuthenticatedRequest, res: Response)
.select({ .select({
action: tipFeedback.action, action: tipFeedback.action,
count: sql<number>`count(*)`, count: sql<number>`count(*)`,
avgHour: sql<number>`avg(json_extract(ts.features_json, '$.hour_of_day'))`, avgHour: sql<number>`avg(json_extract(${tipScores.featuresJson}, '$.hour_of_day'))`,
}) })
.from(tipFeedback) .from(tipFeedback)
.leftJoin(tipScores, eq(tipFeedback.tipId, tipScores.tipId)) .leftJoin(tipScores, eq(tipFeedback.tipId, tipScores.tipId))
@@ -683,6 +683,18 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
_simProcesses.set(id, { pid: child.pid, startedAt: now }); _simProcesses.set(id, { pid: child.pid, startedAt: now });
} }
// Without this listener, a spawn failure (ENOENT when python3 is absent
// — e.g. in the alpine api container) would emit an unhandled 'error' event
// and crash the whole API process.
child.on('error', async (err) => {
console.error('[sim] spawn error', err);
_simProcesses.delete(id);
await db
.update(simRuns)
.set({ status: 'failed', finishedAt: new Date().toISOString() })
.where(eq(simRuns.id, id));
});
// Capture stderr for debugging // Capture stderr for debugging
const stderrLines: string[] = []; const stderrLines: string[] = [];
child.stderr?.on('data', (d: Buffer) => stderrLines.push(d.toString())); child.stderr?.on('data', (d: Buffer) => stderrLines.push(d.toString()));

View File

@@ -65,7 +65,17 @@ async function fetchTodoistTasks(userId: string, accessToken: string): Promise<C
headers: { Authorization: `Bearer ${accessToken}` }, headers: { Authorization: `Bearer ${accessToken}` },
}); });
if (!res.ok) return cached?.tasks ?? []; if (!res.ok) {
if (res.status === 401) {
console.error(`[todoist] token expired for user ${userId}`);
bus.publish('signals.integration.token_expired', {
userId,
provider: 'todoist',
detectedAt: new Date().toISOString(),
});
}
return cached?.tasks ?? [];
}
const body = (await res.json()) as { const body = (await res.json()) as {
results: Array<{ results: Array<{
@@ -230,10 +240,10 @@ router.post('/recommend', requireAuth, async (req: AuthenticatedRequest, res: Re
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------
// Reward inference from action + dwell time // Reward inference from action + dwell time
// //
// Feedback is now 3 signals only: done / snooze / dismiss.
// "Helpfulness" is inferred from how long the user took to act on a tip:
// dismiss → -1.0 (clear rejection) // dismiss → -1.0 (clear rejection)
// snooze → +0.1 (tip noticed, timing off — mild positive) // snooze → +0.1 (tip noticed, timing off — mild positive)
// helpful → +0.5 (explicit positive signal)
// not_helpful → -0.5 (explicit negative signal)
// done < 15 s → -0.3 (almost certainly a stale task, not magic) // done < 15 s → -0.3 (almost certainly a stale task, not magic)
// done 15 s 2 min → +1.0 (magic zone: user saw tip and acted) // done 15 s 2 min → +1.0 (magic zone: user saw tip and acted)
// done 2 10 min → +0.6 (good: user engaged, acted in same session) // done 2 10 min → +0.6 (good: user engaged, acted in same session)
@@ -242,6 +252,8 @@ router.post('/recommend', requireAuth, async (req: AuthenticatedRequest, res: Re
function inferReward(action: string, dwellMs: number | null): number { function inferReward(action: string, dwellMs: number | null): number {
if (action === 'dismiss') return -1.0; if (action === 'dismiss') return -1.0;
if (action === 'snooze') return 0.1; if (action === 'snooze') return 0.1;
if (action === 'helpful') return 0.5;
if (action === 'not_helpful') return -0.5;
// done — use dwell time // done — use dwell time
if (dwellMs === null || dwellMs < 0) return 0.5; // unknown dwell: neutral positive if (dwellMs === null || dwellMs < 0) return 0.5; // unknown dwell: neutral positive
if (dwellMs < 15_000) return -0.3; // stale / reflex if (dwellMs < 15_000) return -0.3; // stale / reflex
@@ -250,6 +262,51 @@ function inferReward(action: string, dwellMs: number | null): number {
return 0.3; // eventually return 0.3; // eventually
} }
// ---------------------------------------------------------------------------
// Reward delivery with retry (bug #75 — was fire-and-forget)
// ---------------------------------------------------------------------------
async function sendRewardWithRetry(
userId: string,
tipId: string,
reward: number,
features: TaskFeatures,
): Promise<void> {
const body = JSON.stringify({
user_id: userId,
tip_id: tipId,
reward,
features,
day_of_week: new Date().getDay(),
});
for (let attempt = 1; attempt <= 3; attempt++) {
try {
const res = await fetch(`${config.ML_SERVING_URL}/reward/egreedy`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body,
signal: AbortSignal.timeout(3000),
});
if (res.ok) return;
throw new Error(`HTTP ${res.status}`);
} catch (err: any) {
if (attempt === 3) {
console.error(`[reward] failed after 3 attempts for tip ${tipId}: ${err.message}`);
bus.publish('signals.tip.reward_failed', {
userId,
tipId,
reward,
attempts: 3,
error: err.message,
failedAt: new Date().toISOString(),
});
return;
}
await new Promise((r) => setTimeout(r, 250 * Math.pow(2, attempt)));
}
}
}
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------
// POST /api/tip/:id/feedback // POST /api/tip/:id/feedback
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------
@@ -258,7 +315,7 @@ router.post('/tip/:id/feedback', requireAuth, async (req: AuthenticatedRequest,
const tipId = String(req.params.id); const tipId = String(req.params.id);
const now = new Date(); const now = new Date();
const validActions = ['done', 'dismiss', 'snooze']; const validActions = ['done', 'dismiss', 'snooze', 'helpful', 'not_helpful'];
if (!validActions.includes(action)) { if (!validActions.includes(action)) {
res.status(400).json({ error: 'Invalid action' }); res.status(400).json({ error: 'Invalid action' });
return; return;
@@ -297,25 +354,14 @@ router.post('/tip/:id/feedback', requireAuth, async (req: AuthenticatedRequest,
bus.publish('signals.tip.feedback', { bus.publish('signals.tip.feedback', {
userId: req.userId!, userId: req.userId!,
tipId, tipId,
action: action as 'done' | 'dismiss' | 'snooze', action: action as 'done' | 'dismiss' | 'snooze' | 'helpful' | 'not_helpful',
reward, reward,
dwellMs, dwellMs,
createdAt: now.toISOString(), createdAt: now.toISOString(),
}); });
if (task) { if (task) {
// Send reward to egreedy-v1 (active policy — ADR-0007) sendRewardWithRetry(req.userId!, tipId, reward, task.features);
fetch(`${config.ML_SERVING_URL}/reward/egreedy`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
user_id: req.userId!,
tip_id: tipId,
reward,
features: task.features,
day_of_week: new Date().getDay(),
}),
}).catch(() => {});
} }
// Mark complete in Todoist if done // Mark complete in Todoist if done

View File

@@ -41,6 +41,8 @@ export function makeTestDb() {
tip_id TEXT NOT NULL, tip_id TEXT NOT NULL,
action TEXT NOT NULL, action TEXT NOT NULL,
source_id TEXT, source_id TEXT,
dwell_ms INTEGER,
reward_milli INTEGER,
created_at TEXT NOT NULL created_at TEXT NOT NULL
); );
@@ -76,6 +78,60 @@ export function makeTestDb() {
detail TEXT, detail TEXT,
created_at TEXT NOT NULL created_at TEXT NOT NULL
); );
CREATE TABLE IF NOT EXISTS tip_scores (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL REFERENCES users(id),
tip_id TEXT NOT NULL,
policy TEXT NOT NULL,
ml_score INTEGER,
features_json TEXT,
candidate_count INTEGER,
latency_ms INTEGER,
served_at TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS saved_queries (
id TEXT PRIMARY KEY,
admin_id TEXT NOT NULL REFERENCES users(id),
name TEXT NOT NULL,
sql TEXT NOT NULL,
created_at TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS sim_runs (
id TEXT PRIMARY KEY,
policy_a TEXT NOT NULL,
policy_b TEXT NOT NULL,
n_users INTEGER NOT NULL,
n_rounds INTEGER NOT NULL,
tasks_per_round INTEGER NOT NULL DEFAULT 8,
use_llm INTEGER NOT NULL DEFAULT 0,
status TEXT NOT NULL DEFAULT 'pending',
summary_json TEXT,
winner TEXT,
persona_breakdown_json TEXT,
created_at TEXT NOT NULL,
finished_at TEXT
);
CREATE TABLE IF NOT EXISTS sim_events (
id TEXT PRIMARY KEY,
run_id TEXT NOT NULL REFERENCES sim_runs(id),
round INTEGER NOT NULL,
user_id TEXT NOT NULL,
persona TEXT NOT NULL,
policy TEXT NOT NULL,
tip_content TEXT NOT NULL,
priority INTEGER NOT NULL,
is_overdue INTEGER NOT NULL,
action TEXT NOT NULL,
dwell_ms INTEGER,
reward_milli INTEGER NOT NULL,
hour INTEGER NOT NULL,
day_of_week INTEGER NOT NULL,
created_at TEXT NOT NULL
);
`); `);
return drizzle(sqlite, { schema }); return drizzle(sqlite, { schema });