chore(infra): wire MLflow/Airflow env vars, fix healthcheck, add .dockerignore

- docker-compose: pass ML_SERVING_URL, MLFLOW_URL, AIRFLOW_URL + creds to api service - docker-compose: pass NEXT_PUBLIC_MLFLOW_URL/AIRFLOW_URL to admin service - docker-compose: replace wget healthcheck with node fetch (wget not in node image) - docker-compose: enable Airflow basic_auth API backend; add MLflow pip dep for DAGs - Dockerfiles: tighten layer caching, add .dockerignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(simulate): MLflow tracking, Airflow DAG integration, health checks for mlflow/airflow
2026-04-26 12:08:43 +00:00 · 2026-04-26 12:08:36 +00:00 · 2026-04-26 12:07:43 +00:00 · 2026-04-26 03:41:39 +00:00 · 2026-04-26 03:37:28 +00:00 · 2026-04-26 03:08:28 +00:00
62 changed files with 3198 additions and 278 deletions
--- a/.dockerignore
+++ b/.dockerignore
@@ -0,0 +1,19 @@
+**/node_modules
+**/.next
+**/dist
+**/coverage
+**/.vitest-cache
+**/.turbo
+.git
+.gitea
+.github
+.vscode
+.idea
+**/.env
+**/.env.local
+**/*.log
+docs
+infra/docker/data
+**/__tests__
+**/*.test.ts
+**/*.test.tsx
--- a/.env.example
+++ b/.env.example
@@ -10,6 +10,32 @@ API_BASE_URL=http://localhost:3078
 WEB_BASE_URL=http://localhost:3000
 ML_SERVING_URL=http://localhost:8000

+# MLflow (mlops profile) — http://localhost:5000/mlflow in dev, https://o.alogins.net/mlflow in prod.
+# MLFLOW_ADMIN_PASSWORD seeds the admin account on first boot (changing it after first run
+# requires the MLflow UI or API — see infra/mlflow/basic_auth.ini).
+MLFLOW_URL=http://localhost:5000
+MLFLOW_ADMIN_PASSWORD=change-me
+# Public URL shown as link in the admin sidebar (must be NEXT_PUBLIC_ to reach the browser).
+NEXT_PUBLIC_MLFLOW_URL=http://localhost:5000
+
+# Airflow (mlops profile) — http://localhost:8080/airflow in dev.
+# Start with: docker compose --profile full --profile mlops up
+AIRFLOW_URL=http://localhost:8080
+AIRFLOW_ADMIN_PASSWORD=change-me
+AIRFLOW_DB_PASSWORD=airflow
+AIRFLOW_SECRET_KEY=change-me-in-prod
+AIRFLOW_FERNET_KEY=
+AIRFLOW_BASE_URL=https://o.alogins.net/airflow
+# Public URL shown as link in the admin sidebar (must be NEXT_PUBLIC_ to reach the browser).
+NEXT_PUBLIC_AIRFLOW_URL=http://localhost:8080
+
+# Shared secret for Airflow→API internal callbacks. Generate: openssl rand -hex 32
+INTERNAL_API_TOKEN=
+
+# Static token for automated/service access to the admin panel (e.g. Playwright tests).
+# Leave empty to disable token-based login. Generate: openssl rand -hex 32
+ADMIN_TOKEN=
+
 # AI stack — shared Agap services (ollama + litellm + langfuse). Not run from oO.
 # Prod: https://llm.alogins.net  |  Dev: http://host.docker.internal:4000 from containers,
 # http://localhost:4000 from host. Ollama: http://host.docker.internal:11434 / :11434.
--- a/.gitea/workflows/buf-check.yaml
+++ b/.gitea/workflows/buf-check.yaml
@@ -0,0 +1,37 @@
+name: buf-check
+
+on:
+  push:
+    branches: [main]
+    paths:
+      - 'packages/shared-types/events/**'
+  pull_request:
+    paths:
+      - 'packages/shared-types/events/**'
+
+jobs:
+  buf:
+    name: Lint & breaking-change check
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Install buf
+        run: |
+          BUF_VERSION=1.50.0
+          curl -sSfL \
+            "https://github.com/bufbuild/buf/releases/download/v${BUF_VERSION}/buf-Linux-x86_64" \
+            -o /usr/local/bin/buf
+          chmod +x /usr/local/bin/buf
+          buf --version
+
+      - name: buf lint
+        run: buf lint packages/shared-types/events
+
+      - name: buf breaking
+        if: github.event_name == 'pull_request'
+        run: |
+          buf breaking packages/shared-types/events \
+            --against ".git#branch=${{ github.base_ref }},subdir=packages/shared-types/events"
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -56,7 +56,7 @@ docs/              architecture notes, ADRs, API specs
 ## Contracts between modules

 - **HTTP** (OpenAPI, in `packages/shared-types/http/`) — synchronous request/response. In-process today; over the network once extracted. Signatures are identical.
- **Events** (Protocol Buffers, in `packages/shared-types/events/`) — durable signals + feedback. Today: in-process `Bus` with a `onPublish` bridge to NATS JetStream when `NATS_URL` is set (ADR-0010). The in-proc bus stays the source of truth — JetStream is the durable mirror that cross-process consumers (`ml/serving`, future feature pipelines) tail. Schema registry enforced in CI when #54 lands; until then payloads are JSON envelopes (ADR-0005).
+- **Events** (Protocol Buffers, in `packages/shared-types/events/`) — durable signals + feedback. Today: in-process `Bus` with a `onPublish` bridge to NATS JetStream when `NATS_URL` is set (ADR-0010). The in-proc bus stays the source of truth — JetStream is the durable mirror that cross-process consumers (`ml/serving`, future feature pipelines) tail. Proto schemas (ADR-0005) live in `packages/shared-types/events/oo/events/v1/`; `buf lint` + `buf breaking` run in CI on every PR touching those files (`.gitea/workflows/buf-check.yaml`).
 - Do not redefine types per module. Regenerate from `shared-types`.

 ## Conventions
@@ -100,7 +100,7 @@ Ollama and LiteLLM are **shared Agap services**, not oO services — they live i

 **M1 shipped. M2 (AI tips) in progress.** See `README.md` for the phase roadmap and `docs/architecture/` for diagrams. Work is tracked as Gitea milestones + issues on `alvis/oO`.

-Active work: AI tip generation pipeline — issues #86–#93 in M2 milestone.
+Active work: bandit promotion (#99 — offline sim + ADR-0012 pending) and M2 issues (#61 freshness SLAs, #78 signal abstraction, #93 model benchmark).

 ## What NOT to do

@@ -112,3 +112,13 @@ Active work: AI tip generation pipeline — issues #86–#93 in M2 milestone.
 - Don't call LLMs directly from application code. All LLM calls go through `ml/serving` (Python) via `LITELLM_URL`. The TS recommender never holds a model name.
 - Don't embed MLflow/Airflow/OpenWebUI in the admin panel. They are external services; link out to them. The admin shell links to `o.alogins.net/mlflow`, `/airflow`, `ai.alogins.net`.
 - Don't `nats.publish()` directly from feature code. All publishes go through the in-process `Bus` (`services/api/src/events/bus.ts`); the NATS adapter (`events/nats.ts`) bridges every publish to JetStream when `NATS_URL` is set. This keeps subscribers, the ring-buffer tail used by the admin event viewer, and JetStream all in lockstep.
+
+## Admin app
+
+`apps/admin` rewrites `/api/*` → `$NEXT_PUBLIC_API_URL/api/*` via `next.config.ts`. So `apiFetch('/admin/stats')` in `apps/admin/src/lib/api.ts` hits the Express backend, not a Next.js route.
+
+Running `tsc --noEmit -p apps/admin/tsconfig.json` always reports `Cannot find module 'next'` errors — expected outside the Next.js build context; use `next build` for real type errors.
+
+## Auth / session pattern
+
+Sessions use an `sid` cookie. Admin routes stack `requireAuth` (sets `req.userId`) then `requireAdmin` (checks `role = 'admin'` in DB). Token-based admin auth: `POST /api/auth/token` with `{ token }` matching `ADMIN_TOKEN` env var sets the `sid` cookie — used by Playwright and CI.
--- a/apps/admin/README.md
+++ b/apps/admin/README.md
@@ -8,6 +8,15 @@ Next.js 15 app. Deployed at `admin.o.alogins.net` (dev: `http://localhost:3080`)
  and checks `role === 'admin'`. First admin is seeded via `ADMIN_SEED_EMAIL` env var at API startup.
 - Admin write actions are appended to the `admin_actions` audit log in the DB.

+## Authentication
+
+Two ways to sign in:
+
+| Method | How |
+|--------|-----|
+| Google OAuth | Click "Sign in with Google" on the login page |
+| Token | `POST /api/auth/token` with `{ token }` matching `ADMIN_TOKEN` env var; sets `sid` cookie valid for 24 h. Used by Playwright tests and CI automation. |
+
 ## Pages

 | Route | Description |
--- a/apps/admin/src/app/login/page.tsx
+++ b/apps/admin/src/app/login/page.tsx
@@ -1,15 +1,67 @@
+'use client';
+
+import { useState } from 'react';
+import { useRouter } from 'next/navigation';
+
 export default function LoginPage() {
+  const router = useRouter();
+  const [token, setToken] = useState('');
+  const [error, setError] = useState('');
+  const [loading, setLoading] = useState(false);
+
+  async function handleTokenLogin(e: React.FormEvent) {
+    e.preventDefault();
+    setError('');
+    setLoading(true);
+    try {
+      const res = await fetch('/api/auth/token', {
+        method: 'POST',
+        credentials: 'include',
+        headers: { 'Content-Type': 'application/json' },
+        body: JSON.stringify({ token }),
+      });
+      if (!res.ok) {
+        const data = await res.json().catch(() => ({}));
+        setError((data as { error?: string }).error ?? 'Invalid token');
+        return;
+      }
+      router.push('/');
+    } catch {
+      setError('Request failed');
+    } finally {
+      setLoading(false);
+    }
+  }
+
  return (
    <div className="flex min-h-screen items-center justify-center">
-      <div className="text-center space-y-4">
+      <div className="text-center space-y-6 w-72">
        <h1 className="text-2xl font-semibold">oO Admin</h1>
-        <p className="text-gray-400 text-sm">Sign in via the main app first, then return here.</p>
+
        <a
          href="/sign-in"
          className="inline-block px-4 py-2 bg-white text-black rounded text-sm font-medium hover:bg-gray-200 transition-colors"
        >
          Sign in with Google
        </a>
+
+        <form onSubmit={handleTokenLogin} className="space-y-3">
+          <input
+            type="password"
+            placeholder="Admin token"
+            value={token}
+            onChange={(e) => setToken(e.target.value)}
+            className="w-full px-3 py-2 bg-gray-900 border border-gray-700 rounded text-sm focus:outline-none focus:border-gray-500"
+          />
+          {error && <p className="text-red-400 text-xs">{error}</p>}
+          <button
+            type="submit"
+            disabled={loading || !token}
+            className="w-full px-4 py-2 bg-gray-700 text-white rounded text-sm font-medium hover:bg-gray-600 disabled:opacity-40 transition-colors"
+          >
+            {loading ? 'Signing in…' : 'Sign in with token'}
+          </button>
+        </form>
      </div>
    </div>
  );
--- a/apps/admin/src/app/simulate/page.tsx
+++ b/apps/admin/src/app/simulate/page.tsx
@@ -0,0 +1,220 @@
+'use client';
+
+import { useEffect, useState } from 'react';
+import { AdminShell } from '@/components/AdminShell';
+import {
+  startSimulation,
+  getSimulationRuns,
+  getSimulationRun,
+  SimRun,
+} from '@/lib/api';
+
+const POLICIES = ['linucb-v1', 'egreedy-v1', 'egreedy-v2'];
+const mlflowBase = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
+const airflowBase = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
+
+function mlflowRunUrl(runId: string) {
+  return `${mlflowBase}/#/experiments/1/runs/${runId}`;
+}
+
+function airflowRunUrl(dagRunId: string) {
+  return `${airflowBase}/dags/bandit_sim/grid?dag_run_id=${encodeURIComponent(dagRunId)}`;
+}
+
+function StatusBadge({ status }: { status: string }) {
+  const cls: Record<string, string> = {
+    running: 'bg-blue-900 text-blue-300 border-blue-800',
+    done:    'bg-green-900 text-green-300 border-green-800',
+    failed:  'bg-red-900 text-red-300 border-red-800',
+    pending: 'bg-gray-800 text-gray-400 border-gray-700',
+  };
+  return (
+    <span className={`text-xs px-2 py-0.5 rounded border ${cls[status] ?? cls.pending}`}>
+      {status}
+    </span>
+  );
+}
+
+function SummaryRow({ run }: { run: SimRun }) {
+  const summary = run.summaryJson ? JSON.parse(run.summaryJson) as Record<string, { total_reward: number; mean_reward: number; n_pulls: number }> : null;
+  return (
+    <div className="bg-gray-900 border border-gray-800 rounded p-4 space-y-2">
+      <div className="flex items-center justify-between">
+        <div className="space-y-0.5">
+          <div className="flex items-center gap-2">
+            <span className="font-mono text-xs text-gray-500">{run.id}</span>
+            <StatusBadge status={run.status} />
+            {run.winner && <span className="text-xs text-indigo-400">winner: {run.winner}</span>}
+          </div>
+          <div className="text-xs text-gray-600">
+            {run.nUsers}u × {run.nRounds}r × {run.tasksPerRound}t/r — {run.judgeMode} judge
+            {' · '}{new Date(run.createdAt).toLocaleString()}
+          </div>
+        </div>
+        <div className="flex items-center gap-2 flex-shrink-0">
+          {run.mlflowRunId && (
+            <a href={mlflowRunUrl(run.mlflowRunId)} target="_blank" rel="noreferrer"
+               className="text-xs text-indigo-400 hover:underline">MLflow ↗</a>
+          )}
+          {run.airflowDagRunId && (
+            <a href={airflowRunUrl(run.airflowDagRunId)} target="_blank" rel="noreferrer"
+               className="text-xs text-indigo-400 hover:underline">Airflow ↗</a>
+          )}
+        </div>
+      </div>
+      {summary && (
+        <div className="grid grid-cols-2 gap-2 pt-1 lg:grid-cols-3">
+          {Object.entries(summary).map(([policy, s]) => (
+            <div key={policy} className={`rounded border p-2 text-xs ${policy === run.winner ? 'border-indigo-700 bg-indigo-950' : 'border-gray-800'}`}>
+              <div className="font-mono font-medium text-gray-300 mb-1">{policy}</div>
+              <div className="text-gray-500 space-y-0.5">
+                <div>total <span className="text-gray-300">{s.total_reward.toFixed(2)}</span></div>
+                <div>mean <span className="text-gray-300">{s.mean_reward.toFixed(4)}</span></div>
+                <div>pulls <span className="text-gray-300">{s.n_pulls}</span></div>
+              </div>
+            </div>
+          ))}
+        </div>
+      )}
+    </div>
+  );
+}
+
+export default function SimulatePage() {
+  const [runs, setRuns] = useState<SimRun[]>([]);
+  const [loading, setLoading] = useState(true);
+  const [launching, setLaunching] = useState(false);
+  const [error, setError] = useState('');
+  const [msg, setMsg] = useState('');
+
+  const [nUsers, setNUsers]             = useState(5);
+  const [nRounds, setNRounds]           = useState(20);
+  const [tasksPerRound, setTasksPerRound] = useState(8);
+  const [judgeMode, setJudgeMode]       = useState<'rule' | 'llm'>('rule');
+  const [selectedPolicies, setSelectedPolicies] = useState<string[]>(['linucb-v1', 'egreedy-v1']);
+
+  const refresh = () =>
+    getSimulationRuns()
+      .then((r) => setRuns(r.runs))
+      .catch((e) => setError(e.message))
+      .finally(() => setLoading(false));
+
+  useEffect(() => {
+    refresh();
+    const t = setInterval(refresh, 8_000);
+    return () => clearInterval(t);
+  }, []);
+
+  const togglePolicy = (p: string) =>
+    setSelectedPolicies((prev) =>
+      prev.includes(p) ? prev.filter((x) => x !== p) : [...prev, p],
+    );
+
+  const handleLaunch = async () => {
+    if (selectedPolicies.length < 2) { setError('Select at least 2 policies.'); return; }
+    setLaunching(true); setError(''); setMsg('');
+    try {
+      const r = await startSimulation({ nUsers, nRounds, tasksPerRound, judgeMode, policies: selectedPolicies });
+      setMsg(r.airflow_dag_run_id
+        ? `Launched via Airflow — dag_run_id: ${r.airflow_dag_run_id}`
+        : `Launched locally — run id: ${r.id}`);
+      await refresh();
+    } catch (e: unknown) {
+      setError((e as Error).message);
+    } finally {
+      setLaunching(false);
+    }
+  };
+
+  return (
+    <AdminShell>
+      <div className="space-y-8 max-w-4xl">
+        <h1 className="text-xl font-semibold">Simulations</h1>
+        {error && <p className="text-red-400 text-sm">{error}</p>}
+        {msg   && <p className="text-green-400 text-sm">{msg}</p>}
+
+        {/* Launch form */}
+        <section className="bg-gray-900 border border-gray-800 rounded p-5 space-y-4">
+          <h2 className="text-base font-medium text-gray-300">New simulation</h2>
+
+          <div className="grid grid-cols-3 gap-4 text-sm">
+            <label className="space-y-1">
+              <span className="text-gray-500">Users</span>
+              <input type="number" min={1} max={50} value={nUsers}
+                onChange={(e) => setNUsers(Number(e.target.value))}
+                className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
+            </label>
+            <label className="space-y-1">
+              <span className="text-gray-500">Rounds</span>
+              <input type="number" min={1} max={200} value={nRounds}
+                onChange={(e) => setNRounds(Number(e.target.value))}
+                className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
+            </label>
+            <label className="space-y-1">
+              <span className="text-gray-500">Tasks/round</span>
+              <input type="number" min={1} max={20} value={tasksPerRound}
+                onChange={(e) => setTasksPerRound(Number(e.target.value))}
+                className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
+            </label>
+          </div>
+
+          <div className="space-y-1 text-sm">
+            <span className="text-gray-500">Policies (select ≥ 2)</span>
+            <div className="flex gap-2 flex-wrap pt-1">
+              {POLICIES.map((p) => (
+                <button key={p} onClick={() => togglePolicy(p)}
+                  className={`px-3 py-1 rounded border text-xs font-mono ${
+                    selectedPolicies.includes(p)
+                      ? 'bg-indigo-900 border-indigo-700 text-indigo-200'
+                      : 'border-gray-700 text-gray-500 hover:border-gray-500'
+                  }`}>
+                  {p}
+                </button>
+              ))}
+            </div>
+          </div>
+
+          <div className="space-y-1 text-sm">
+            <span className="text-gray-500">Judge</span>
+            <div className="flex gap-2 pt-1">
+              {(['rule', 'llm'] as const).map((m) => (
+                <button key={m} onClick={() => setJudgeMode(m)}
+                  className={`px-3 py-1 rounded border text-xs ${
+                    judgeMode === m
+                      ? 'bg-gray-700 border-gray-500 text-white'
+                      : 'border-gray-700 text-gray-500 hover:border-gray-500'
+                  }`}>
+                  {m}
+                </button>
+              ))}
+            </div>
+            {judgeMode === 'llm' && (
+              <p className="text-xs text-yellow-600 mt-1">LLM judge requires ANTHROPIC_API_KEY in ml/serving env.</p>
+            )}
+          </div>
+
+          <button onClick={handleLaunch} disabled={launching}
+            className="bg-indigo-600 hover:bg-indigo-500 disabled:opacity-50 text-white rounded px-4 py-2 text-sm">
+            {launching ? 'Launching…' : 'Launch simulation'}
+          </button>
+          <p className="text-xs text-gray-600">
+            Runs via <a href={airflowBase} target="_blank" rel="noreferrer" className="text-indigo-500 hover:underline">Airflow</a> (mlops profile) when available; falls back to local subprocess.
+            Results logged to <a href={mlflowBase} target="_blank" rel="noreferrer" className="text-indigo-500 hover:underline">MLflow</a>.
+          </p>
+        </section>
+
+        {/* Run history */}
+        <section className="space-y-3">
+          <h2 className="text-base font-medium text-gray-300">
+            Run history
+            {loading && <span className="text-xs text-gray-600 ml-2">loading…</span>}
+          </h2>
+          {runs.length === 0 && !loading && (
+            <p className="text-gray-600 text-sm">No simulations yet.</p>
+          )}
+          {runs.map((r) => <SummaryRow key={r.id} run={r} />)}
+        </section>
+      </div>
+    </AdminShell>
+  );
+}
--- a/apps/admin/src/components/AdminShell.tsx
+++ b/apps/admin/src/components/AdminShell.tsx
@@ -2,6 +2,7 @@

 import Link from 'next/link';
 import { usePathname } from 'next/navigation';
+import { useEffect, useState } from 'react';

 const mlflowUrl  = process.env.NEXT_PUBLIC_MLFLOW_URL  ?? '/mlflow';
 const airflowUrl = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
@@ -10,6 +11,7 @@ type NavItem = {
  href: string;
  label: string;
  external?: boolean;
+  svcName?: string; // key in the health services map
 };

 type NavSection = {
@@ -31,10 +33,11 @@ const NAV: NavSection[] = [
    ],
  },
  {
-    label: 'Recommender status',
+    label: 'Recommender',
    items: [
      { href: '/tips',             label: 'Tips' },
      { href: '/reward-analytics', label: 'Rewards' },
+      { href: '/simulate',         label: 'Simulations' },
    ],
  },
  {
@@ -50,14 +53,33 @@ const NAV: NavSection[] = [
    label: 'Resources',
    items: [
      { href: '/docs',     label: 'Docs' },
-      { href: mlflowUrl, label: 'MLflow ↗', external: true },
-      { href: airflowUrl, label: 'Airflow ↗', external: true },
+      { href: mlflowUrl,  label: 'MLflow ↗',  external: true, svcName: 'mlflow' },
+      { href: airflowUrl, label: 'Airflow ↗', external: true, svcName: 'airflow' },
    ],
  },
 ];

+const STATUS_DOT: Record<string, string> = {
+  ok:       'bg-green-500',
+  degraded: 'bg-yellow-400',
+  down:     'bg-red-500',
+};
+
 export function AdminShell({ children }: { children: React.ReactNode }) {
  const pathname = usePathname();
+  const [svcStatus, setSvcStatus] = useState<Record<string, string>>({});
+
+  useEffect(() => {
+    fetch('/api/admin/health', { credentials: 'include' })
+      .then((r) => r.json())
+      .then((data: { services?: { name: string; status: string }[] }) => {
+        const map: Record<string, string> = {};
+        for (const s of data.services ?? []) map[s.name] = s.status;
+        setSvcStatus(map);
+      })
+      .catch(() => {});
+  }, []);
+
  return (
    <div className="flex min-h-screen">
      {/* Sidebar */}
@@ -83,13 +105,19 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
                  const active =
                    !item.external &&
                    (item.href === '/' ? pathname === '/' : pathname.startsWith(item.href));
-                  const className = `flex items-center px-3 py-2 rounded text-sm transition-colors ${
+                  const className = `flex items-center gap-2 px-3 py-2 rounded text-sm transition-colors ${
                    active
                      ? 'bg-gray-800 text-white font-medium'
                      : item.external
                        ? 'text-gray-500 hover:text-white hover:bg-gray-900'
                        : 'text-gray-400 hover:text-white hover:bg-gray-900'
                  }`;
+                  const dot = item.svcName
+                    ? svcStatus[item.svcName]
+                      ? <span className={`inline-block w-1.5 h-1.5 rounded-full flex-shrink-0 ${STATUS_DOT[svcStatus[item.svcName]] ?? STATUS_DOT.down}`} />
+                      : <span className="inline-block w-1.5 h-1.5 rounded-full flex-shrink-0 bg-gray-700" />
+                    : null;
+
                  return item.external ? (
                    <a
                      key={item.href}
@@ -98,6 +126,7 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
                      rel="noreferrer"
                      className={className}
                    >
+                      {dot}
                      {item.label}
                    </a>
                  ) : (
--- a/apps/admin/src/lib/api.ts
+++ b/apps/admin/src/lib/api.ts
@@ -262,3 +262,49 @@ export function saveQuery(name: string, querySql: string) {
 export function deleteSavedQuery(id: string) {
  return apiFetch<{ ok: boolean }>(`/admin/saved-queries/${id}`, { method: 'DELETE' });
 }
+
+// ── Simulations ────────────────────────────────────────────────────────────
+
+export interface SimRun {
+  id: string;
+  policyA: string;
+  policyB: string;
+  nUsers: number;
+  nRounds: number;
+  tasksPerRound: number;
+  judgeMode: string;
+  nPolicies: number;
+  status: 'pending' | 'running' | 'done' | 'failed';
+  summaryJson: string | null;
+  winner: string | null;
+  personaBreakdownJson: string | null;
+  airflowDagRunId: string | null;
+  mlflowRunId: string | null;
+  createdAt: string;
+  finishedAt: string | null;
+}
+
+export interface SimStartRequest {
+  nUsers?: number;
+  nRounds?: number;
+  tasksPerRound?: number;
+  judgeMode?: 'rule' | 'llm';
+  policies?: string[];
+}
+
+export function startSimulation(req: SimStartRequest) {
+  return apiFetch<{ id: string; status: string; airflow_dag_run_id?: string }>(
+    '/admin/simulate/start',
+    { method: 'POST', body: JSON.stringify(req) },
+  );
+}
+
+export function getSimulationRuns() {
+  return apiFetch<{ runs: SimRun[] }>('/admin/simulate/runs');
+}
+
+export function getSimulationRun(id: string) {
+  return apiFetch<{ run: SimRun & { isRunning: boolean }; events: unknown[] }>(
+    `/admin/simulate/${id}`,
+  );
+}
--- a/apps/admin/tsconfig.tsbuildinfo
+++ b/apps/admin/tsconfig.tsbuildinfo
--- a/docs/adr/0012-egreedy-v2-profile-features.md
+++ b/docs/adr/0012-egreedy-v2-profile-features.md
@@ -1,7 +1,7 @@
 # ADR-0012 — ε-greedy v2: profile features in the bandit (D=7→12)

-**Status:** Accepted  
-**Date:** 2026-04-25  
+**Status:** Promoted  
+**Date:** 2026-04-25 (accepted) / 2026-04-26 (promoted)  
 **Issue:** #99

 ## Context
@@ -106,3 +106,19 @@ projecting theta without the corresponding `A` matrix cannot be done correctly.
 the D=12 target in the issue spec and complicates the sim comparison. Deferred.

 **In-place v1 promotion without shadow** — violates ADR-0002.
+
+## Promotion record (2026-04-26)
+
+Offline sim (`runner.py --policies egreedy-v1 egreedy-v2 --judge rule --n-users 5 --n-rounds 20 --seed 42`):
+
+| policy | total reward | mean reward | pulls |
+|--------|-------------|-------------|-------|
+| egreedy-v1 | −64.20 | −0.6420 | 100 |
+| egreedy-v2 | −62.90 | −0.6290 | 100 |
+
+**Gate passed** (v2 mean ≥ v1 mean). Per-persona: v2 wins deadline-driven, evening-relaxed, low-priority-first; v1 wins consistent-responder, overdue-ignorer.
+
+Changes applied:
+- `recommender.ts` `remotePolicy()`: `/score/egreedy` → `/score/egreedy/v2`
+- `recommender.ts` `sendRewardWithRetry()`: `/reward/egreedy` → `/reward/egreedy/v2`, added `profile_features` to payload
+- Shadow entry `egreedy-v2-shadow` left in registry (`active: false`) for rollback.
--- a/infra/docker/Dockerfile.admin
+++ b/infra/docker/Dockerfile.admin
@@ -1,21 +1,22 @@
-FROM node:22-alpine AS base
-RUN npm install -g pnpm
+# syntax=docker/dockerfile:1.7

-FROM base AS deps
-WORKDIR /app
-COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
-COPY packages/shared-types/package.json ./packages/shared-types/
-COPY apps/admin/package.json ./apps/admin/
-RUN pnpm install --frozen-lockfile
+FROM node:22-slim AS base
+RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates \
+ && rm -rf /var/lib/apt/lists/* \
+ && npm install -g pnpm
+ENV CI=true \
+    PNPM_HOME=/pnpm \
+    PATH=/pnpm:$PATH
+RUN pnpm config set store-dir /pnpm/store

 FROM base AS builder
 WORKDIR /app
-COPY --from=deps /app/node_modules ./node_modules
-COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules
-COPY --from=deps /app/apps/admin/node_modules ./apps/admin/node_modules
-COPY tsconfig.base.json ./
-COPY packages/shared-types ./packages/shared-types
-COPY apps/admin ./apps/admin
+COPY pnpm-lock.yaml ./
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm fetch
+COPY . .
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
+    pnpm install --frozen-lockfile --offline \
+      --filter @oo/admin... --filter @oo/shared-types
 RUN pnpm --filter @oo/shared-types build
 ARG NEXT_PUBLIC_MLFLOW_URL=/mlflow
 ARG NEXT_PUBLIC_AIRFLOW_URL=/airflow
@@ -24,7 +25,7 @@ ENV NEXT_TELEMETRY_DISABLED=1 \
    NEXT_PUBLIC_AIRFLOW_URL=$NEXT_PUBLIC_AIRFLOW_URL
 RUN pnpm --filter @oo/admin build

-FROM node:22-alpine AS runner
+FROM node:22-slim AS runner
 ENV NODE_ENV=production NEXT_TELEMETRY_DISABLED=1 PORT=3080
 WORKDIR /app
 COPY --from=builder /app/apps/admin/.next/standalone ./
--- a/infra/docker/Dockerfile.api
+++ b/infra/docker/Dockerfile.api
@@ -1,32 +1,35 @@
-FROM node:22-alpine AS base
-RUN npm install -g pnpm
+# syntax=docker/dockerfile:1.7

-FROM base AS deps
-WORKDIR /app
-COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
-COPY packages/shared-types/package.json ./packages/shared-types/
-COPY services/api/package.json ./services/api/
-RUN pnpm install --frozen-lockfile
+FROM node:22-slim AS base
+RUN apt-get update && apt-get install -y --no-install-recommends \
+      python3 make g++ ca-certificates \
+ && rm -rf /var/lib/apt/lists/* \
+ && npm install -g pnpm
+ENV CI=true \
+    PNPM_HOME=/pnpm \
+    PATH=/pnpm:$PATH
+RUN pnpm config set store-dir /pnpm/store

 FROM base AS builder
 WORKDIR /app
-COPY --from=deps /app/node_modules ./node_modules
-COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules
-COPY --from=deps /app/services/api/node_modules ./services/api/node_modules
-COPY tsconfig.base.json ./
-COPY packages/shared-types ./packages/shared-types
-COPY services/api ./services/api
+COPY pnpm-lock.yaml ./
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm fetch
+COPY . .
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
+    pnpm install --frozen-lockfile --offline \
+      --filter @oo/api... --filter @oo/shared-types
 RUN pnpm --filter @oo/shared-types build
 RUN pnpm --filter @oo/api build
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
+    pnpm --filter @oo/api --prod deploy --legacy /deploy \
+ && cp -r services/api/dist /deploy/dist \
+ && rm -rf /deploy/node_modules/@oo/shared-types/src \
+ && cp -r packages/shared-types/dist /deploy/node_modules/@oo/shared-types/dist

-FROM node:22-alpine AS runner
+FROM node:22-slim AS runner
 WORKDIR /app
-RUN npm install -g pnpm
-COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
-COPY packages/shared-types/package.json ./packages/shared-types/
-COPY services/api/package.json ./services/api/
-RUN pnpm install --prod --frozen-lockfile
-COPY --from=builder /app/packages/shared-types/dist ./packages/shared-types/dist
-COPY --from=builder /app/services/api/dist ./services/api/dist
-WORKDIR /app/services/api
+ENV NODE_ENV=production
+COPY --from=builder /deploy/package.json ./
+COPY --from=builder /deploy/node_modules ./node_modules
+COPY --from=builder /deploy/dist ./dist
 CMD ["node", "dist/index.js"]
--- a/infra/docker/Dockerfile.ml
+++ b/infra/docker/Dockerfile.ml
@@ -2,5 +2,5 @@ FROM python:3.12-slim
 WORKDIR /app
 COPY ml/serving/requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
-COPY ml/serving/main.py .
+COPY ml/serving/*.py .
 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
--- a/infra/docker/docker-compose.yml
+++ b/infra/docker/docker-compose.yml
@@ -11,12 +11,18 @@ services:
    env_file: ../../.env.local
    environment:
      NODE_ENV: production
+      ML_SERVING_URL: "http://ml-serving:8000"
+      MLFLOW_URL: "http://mlflow:5000"
+      AIRFLOW_URL: "http://airflow-webserver:8080"
+      AIRFLOW_API_USER: "admin"
+      AIRFLOW_API_PASSWORD: "${AIRFLOW_ADMIN_PASSWORD:-admin}"
+      INTERNAL_API_TOKEN: "${INTERNAL_API_TOKEN:-}"
    volumes:
      - /mnt/ssd/dbs/oo:/mnt/ssd/dbs/oo
    ports:
      - "127.0.0.1:3078:3078"
    healthcheck:
-      test: ["CMD", "wget", "--spider", "-q", "http://localhost:3078/health"]
+      test: ["CMD", "node", "-e", "fetch('http://localhost:3078/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"]
      interval: 10s
      timeout: 5s
      retries: 5
@@ -49,6 +55,8 @@ services:
      PORT: "3080"
      HOSTNAME: "0.0.0.0"
      NEXT_PUBLIC_API_URL: ""
+      NEXT_PUBLIC_MLFLOW_URL: "/mlflow"
+      NEXT_PUBLIC_AIRFLOW_URL: "/airflow"
      INTERNAL_API_URL: "http://api:3078"
    ports:
      - "127.0.0.1:3080:3080"
@@ -133,8 +141,14 @@ services:
      AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY:-change-me-in-prod}
      AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
      AIRFLOW__WEBSERVER__BASE_URL: ${AIRFLOW_BASE_URL:-https://o.alogins.net/airflow}
+      AIRFLOW__API__AUTH_BACKENDS: "airflow.api.auth.backend.basic_auth"
+      _PIP_ADDITIONAL_REQUIREMENTS: "mlflow==2.14.3 httpx"
+      MLFLOW_TRACKING_URI: "http://mlflow:5000/mlflow"
+      MLFLOW_TRACKING_USERNAME: "admin"
+      MLFLOW_TRACKING_PASSWORD: "${MLFLOW_ADMIN_PASSWORD:-password}"
    volumes:
      - ../../ml/pipelines:/opt/airflow/dags:ro
+      - ../../ml:/opt/airflow/ml:ro
    ports:
      - "127.0.0.1:8080:8080"
    depends_on:
@@ -155,8 +169,13 @@ services:
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${AIRFLOW_DB_PASSWORD:-airflow}@airflow-db/airflow
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
+      _PIP_ADDITIONAL_REQUIREMENTS: "mlflow==2.14.3 httpx"
+      MLFLOW_TRACKING_URI: "http://mlflow:5000/mlflow"
+      MLFLOW_TRACKING_USERNAME: "admin"
+      MLFLOW_TRACKING_PASSWORD: "${MLFLOW_ADMIN_PASSWORD:-password}"
    volumes:
      - ../../ml/pipelines:/opt/airflow/dags:ro
+      - ../../ml:/opt/airflow/ml:ro
    depends_on:
      airflow-init:
        condition: service_completed_successfully
--- a/ml/README.md
+++ b/ml/README.md
@@ -4,8 +4,8 @@ Python. Owns models, features, training, online scoring.

 | Dir | Role | Phase |
 |---|---|---|
-| `serving/` | FastAPI online scorer (`/score`, `/generate`) + LiteLLM gateway + prompt registry (`prompts.py`), called by `recommender` | 1–2 |
-| `features/` | context assembler (`context.py`): signals → `PromptContext`; Feast adapter later | 2 |
+| `serving/` | FastAPI online scorer (`/score`, `/generate`) + LiteLLM gateway + prompt registry (`prompts.py`) + JetStream consumers for `signals.>` / `feedback.>`, called by `recommender` | 1–2 |
+| `features/` | context assembler (`context.py`): signals → `PromptContext`; profile-feature schema mirror (`profile_schema.py`); Feast adapter later | 2 |
 | `pipelines/` | batch feature + training DAGs (Prefect/Airflow) | 4 |
 | `registry/` | MLflow-backed model registry integration | 4 |
 | `experiments/` | A/B assignment + multi-armed bandit policies | 4 |
@@ -18,14 +18,24 @@ Python. Owns models, features, training, online scoring.
 - Training reads from the offline feature store; serving reads from the online feature store; definitions are shared (no train/serve skew).
 - Shadow deploys before any policy change that affects real users.

-## Profile-feature contract
+## Feature contract
+
+### Profile features (batched)

 User-level features (completion rate, preferred hour, tip volume…) are computed
-by the TypeScript recommender and shipped to ml/serving on every `/score` and
+by the TypeScript recommender and shipped to `ml/serving` on every `/score` and
 `/generate` call as `profile_features: dict | None`. The Python mirror in
-`features/profile_schema.py` documents the available names + dtypes — keep it
-in sync with `services/api/src/profile/registry.ts` (a CI-style test asserts
-the name sets match). See ADR-0011.
+`features/profile_schema.py` documents each feature's name, dtype, TTL, source,
+and null fallback — keep it in sync with `services/api/src/profile/registry.ts`
+(a CI-style test asserts names and `ttlSec` values match). See ADR-0011.
+
+### Context features (JIT)
+
+Request-time signals assembled by `features/context.py` (`hour_of_day`,
+`day_of_week`, task list). These are never cached — they are derived from the
+system clock and the live Todoist feed at the moment of the score call.
+`CONTEXT_FEATURES` in `context.py` declares freshness, source, and fallback for
+each field (issue #61).

 ## Prompt registry

--- a/ml/experiments/sim/runner.py
+++ b/ml/experiments/sim/runner.py
@@ -26,6 +26,7 @@ from __future__ import annotations

 import argparse
 import json
+import os
 import random
 import sys
 import time
@@ -40,6 +41,12 @@ from llm_judge import ACTIONS, infer_reward, judge
 from personas import PERSONAS, Persona
 from task_generator import generate_task_pool

+try:
+    import mlflow
+    _MLFLOW_AVAILABLE = True
+except ImportError:
+    _MLFLOW_AVAILABLE = False
+
 POLICY_SCORE_ENDPOINTS: dict[str, str] = {
    "linucb-v1": "/score",
    "egreedy-v1": "/score/egreedy",
@@ -107,14 +114,30 @@ def _call_reward(

 # ── Standard single-pass runner (rule / llm modes) ─────────────────────────

+def _init_mlflow(mlflow_url: str | None, experiment: str) -> str | None:
+    """Set up MLflow tracking and return the active run_id, or None if unavailable."""
+    if not _MLFLOW_AVAILABLE or not mlflow_url:
+        return None
+    try:
+        mlflow.set_tracking_uri(mlflow_url)
+        mlflow.set_experiment(experiment)
+        return "ready"
+    except Exception as e:
+        print(f"  [warn] MLflow init failed: {e}", file=sys.stderr)
+        return None
+
+
 def run_simulation(
    n_users: int, n_rounds: int, tasks_per_round: int,
    ml_url: str, policies: list[str], use_llm: bool, seed: int,
+    mlflow_url: str | None = None, mlflow_experiment: str = "bandit_simulation",
 ) -> dict:
    rng = random.Random(seed)
    run_id = str(uuid.uuid4())[:8]
    started_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())

+    _init_mlflow(mlflow_url, mlflow_experiment)
+
    user_personas = [
        (f"sim-{run_id}-u{i}", PERSONAS[i % len(PERSONAS)])
        for i in range(n_users)
@@ -130,6 +153,26 @@ def run_simulation(
    }
    events: list[dict] = []

+    mlflow_run_id: str | None = None
+    mlflow_ctx = (
+        mlflow.start_run(run_name=run_id)
+        if (_MLFLOW_AVAILABLE and mlflow_url)
+        else None
+    )
+
+    try:
+        if mlflow_ctx:
+            active = mlflow_ctx.__enter__()
+            mlflow_run_id = active.info.run_id
+            mlflow.log_params({
+                "n_users": n_users,
+                "n_rounds": n_rounds,
+                "tasks_per_round": tasks_per_round,
+                "policies": ",".join(policies),
+                "judge": "llm" if use_llm else "rule",
+                "seed": seed,
+            })
+
        with httpx.Client(trust_env=False) as client:
            for rnd in range(n_rounds):
                hour = rng.randint(6, 22)
@@ -139,8 +182,6 @@ def run_simulation(
                for user_id, persona in user_personas:
                    seed_tasks = rnd * 997 + abs(hash(user_id)) % 997
                    tasks = generate_task_pool(n=tasks_per_round, seed=seed_tasks)
-
-                # Per-persona profile features for v2 (synthetic for sim — see ADR-0012)
                    profile = persona.profile_features(hour) if hasattr(persona, "profile_features") else None

                    for policy in policies:
@@ -179,13 +220,34 @@ def run_simulation(
                    prev = acc[p]["cumulative_rewards"][-1] if acc[p]["cumulative_rewards"] else 0.0
                    acc[p]["cumulative_rewards"].append(prev + round_rewards[p])

+                if mlflow_ctx:
+                    for p in policies:
+                        mlflow.log_metric(f"{p}_cumulative_reward",
+                                          acc[p]["cumulative_rewards"][-1], step=rnd)
+
                mode = "llm" if use_llm else "rule"
                print(f"  Round {rnd+1:>3}/{n_rounds} [{mode}]  " + "  ".join(
                    f"{p}={acc[p]['cumulative_rewards'][-1]:+.2f}" for p in policies
                ))

-    return _build_result(run_id, started_at, policies, acc, events,
+        result = _build_result(run_id, started_at, policies, acc, events,
                               n_users, n_rounds, tasks_per_round, use_llm, seed)
+        result["mlflow_run_id"] = mlflow_run_id
+
+        if mlflow_ctx:
+            for p, s in result["summary"].items():
+                mlflow.log_metrics({
+                    f"{p}_total_reward": s["total_reward"],
+                    f"{p}_mean_reward": s["mean_reward"],
+                    f"{p}_n_pulls": s["n_pulls"],
+                })
+            mlflow.set_tag("winner", result["winner"])
+
+        return result
+
+    finally:
+        if mlflow_ctx:
+            mlflow_ctx.__exit__(None, None, None)


 # ── Claude Code judge — phase 1: score ─────────────────────────────────────
@@ -494,6 +556,9 @@ if __name__ == "__main__":
                        help="Alias for --judge rule (backwards compat)")
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--out", default=None)
+    parser.add_argument("--mlflow-url", default=os.environ.get("MLFLOW_TRACKING_URI"),
+                        help="MLflow tracking URI (e.g. http://mlflow:5000/mlflow)")
+    parser.add_argument("--mlflow-experiment", default="bandit_simulation")
    args = parser.parse_args()

    if args.no_llm:
@@ -534,6 +599,7 @@ if __name__ == "__main__":
            n_users=args.n_users, n_rounds=args.n_rounds,
            tasks_per_round=args.tasks_per_round, ml_url=args.ml_url,
            policies=args.policies, use_llm=use_llm, seed=args.seed,
+            mlflow_url=args.mlflow_url, mlflow_experiment=args.mlflow_experiment,
        )
        Path(out_path).write_text(json.dumps(result, indent=2))
        print()
--- a/ml/features/init.py
+++ b/ml/features/init.py
@@ -1,3 +1,8 @@
-from .context import build_context, PromptContext, TaskSignal
+from .context import build_context, PromptContext, TaskSignal, ContextFeatureSpec, CONTEXT_FEATURES
+from .profile_schema import ProfileFeature, PROFILE_FEATURES, feature_names

-__all__ = ["build_context", "PromptContext", "TaskSignal"]
+__all__ = [
+    "build_context", "PromptContext", "TaskSignal",
+    "ContextFeatureSpec", "CONTEXT_FEATURES",
+    "ProfileFeature", "PROFILE_FEATURES", "feature_names",
+]
--- a/ml/features/context.py
+++ b/ml/features/context.py
@@ -2,12 +2,56 @@
 Context assembler — converts raw user signals into a PromptContext for LLM tip generation.

 Usage:
-    from ml.features.context import build_context
+    from ml.features.context import build_context, CONTEXT_FEATURES
    ctx = build_context(tasks, hour_of_day=9, day_of_week=2)
+
+Feature-spec (issue #61):
+  All context features are JIT — they are assembled at request time from live
+  sources (system clock, caller-supplied task list) rather than read from a
+  cached profile store. They carry no TTL because they are never persisted.
 """

 from __future__ import annotations
 from dataclasses import dataclass, field
+from typing import Literal
+
+
+@dataclass(frozen=True)
+class ContextFeatureSpec:
+    name: str
+    dtype: Literal["numeric", "categorical", "list"]
+    freshness: Literal["jit", "batched"]
+    source: str
+    fallback: str
+    description: str
+
+
+CONTEXT_FEATURES: tuple[ContextFeatureSpec, ...] = (
+    ContextFeatureSpec(
+        name="hour_of_day",
+        dtype="numeric",
+        freshness="jit",
+        source="request",
+        fallback="12",
+        description="Current hour (0–23), supplied by the caller at score time.",
+    ),
+    ContextFeatureSpec(
+        name="day_of_week",
+        dtype="numeric",
+        freshness="jit",
+        source="request",
+        fallback="0",
+        description="ISO weekday (0=Monday … 6=Sunday), supplied by the caller at score time.",
+    ),
+    ContextFeatureSpec(
+        name="tasks",
+        dtype="list",
+        freshness="jit",
+        source="todoist-integration",
+        fallback="[]",
+        description="User's open tasks fetched live from the Todoist integration at request time.",
+    ),
+)


@dataclass
--- a/ml/features/profile_schema.py
+++ b/ml/features/profile_schema.py
@@ -8,6 +8,12 @@ code (ml/serving, eval harnesses, notebooks) knows what fields to expect on

 Update this file whenever you add or rename a feature in the TS registry.
 The accompanying test asserts the two stay in sync at the name level.
+
+Feature-spec fields (issue #61):
+  freshness — "batched": value cached in profile store, recomputed on TTL/event.
+  ttl_sec   — cache lifetime in seconds; mirrors ``ttlSec`` in registry.ts.
+  source    — where the value originates.
+  fallback  — raw value returned when the feature is unavailable (null stored).
 """
 from __future__ import annotations

@@ -16,6 +22,10 @@ from typing import Literal


 Dtype = Literal["numeric", "categorical"]
+Freshness = Literal["jit", "batched"]
+
+_HOUR = 3600
+_DAY = 86_400


@dataclass(frozen=True)
@@ -23,28 +33,57 @@ class ProfileFeature:
    name: str
    dtype: Dtype
    description: str
+    freshness: Freshness
+    ttl_sec: int
+    source: str
+    fallback: str


 PROFILE_FEATURES: tuple[ProfileFeature, ...] = (
    ProfileFeature(
-        "completion_rate_30d", "numeric",
-        'Fraction of tips served in the last 30 days that received a "done" reaction.',
+        name="completion_rate_30d",
+        dtype="numeric",
+        description='Fraction of tips served in the last 30 days that received a "done" reaction.',
+        freshness="batched",
+        ttl_sec=6 * _HOUR,
+        source="profile_store",
+        fallback="0.0",
    ),
    ProfileFeature(
-        "dismiss_rate_30d", "numeric",
-        'Fraction of tips served in the last 30 days that received a "dismiss" reaction.',
+        name="dismiss_rate_30d",
+        dtype="numeric",
+        description='Fraction of tips served in the last 30 days that received a "dismiss" reaction.',
+        freshness="batched",
+        ttl_sec=6 * _HOUR,
+        source="profile_store",
+        fallback="0.0",
    ),
    ProfileFeature(
-        "mean_dwell_ms_30d", "numeric",
-        "Average dwell time (ms between served and reacted) over the last 30 days.",
+        name="mean_dwell_ms_30d",
+        dtype="numeric",
+        description="Average dwell time (ms between served and reacted) over the last 30 days.",
+        freshness="batched",
+        ttl_sec=6 * _HOUR,
+        source="profile_store",
+        fallback="null — serving normalises to 0.0",
    ),
    ProfileFeature(
-        "preferred_hour", "numeric",
-        'Hour-of-day with the most "done" reactions in the last 30 days (0-23).',
+        name="preferred_hour",
+        dtype="numeric",
+        description='Hour-of-day with the most "done" reactions in the last 30 days (0–23).',
+        freshness="batched",
+        ttl_sec=_DAY,
+        source="profile_store",
+        fallback="null — serving normalises to 0.5 (neutral alignment)",
    ),
    ProfileFeature(
-        "tip_volume_30d", "numeric",
-        "Number of tips served to the user in the last 30 days.",
+        name="tip_volume_30d",
+        dtype="numeric",
+        description="Number of tips served to the user in the last 30 days.",
+        freshness="batched",
+        ttl_sec=_HOUR,
+        source="profile_store",
+        fallback="0",
    ),
 )

--- a/ml/features/test_context.py
+++ b/ml/features/test_context.py
@@ -1,7 +1,7 @@
 """Tests for ml/features/context.py"""
 import pytest
 import sys, os; sys.path.insert(0, os.path.dirname(__file__))
-from context import build_context, TaskSignal, PromptContext
+from context import build_context, TaskSignal, PromptContext, CONTEXT_FEATURES


 def test_empty_tasks():
@@ -62,3 +62,30 @@ def test_due_date_none_preserved():
    tasks = [TaskSignal(id="x", content="No due", due_date=None)]
    ctx = build_context(tasks)
    assert ctx.tasks[0]["due_date"] is None
+
+
+# ── CONTEXT_FEATURES spec tests (issue #61) ──────────────────────────────────
+
+def test_context_features_expected_names():
+    names = {f.name for f in CONTEXT_FEATURES}
+    assert names == {"hour_of_day", "day_of_week", "tasks"}
+
+
+def test_context_features_all_jit():
+    for f in CONTEXT_FEATURES:
+        assert f.freshness == "jit", f"{f.name}: expected freshness='jit', got {f.freshness!r}"
+
+
+def test_context_features_source_set():
+    for f in CONTEXT_FEATURES:
+        assert f.source, f"{f.name}: source must not be empty"
+
+
+def test_context_features_fallback_set():
+    for f in CONTEXT_FEATURES:
+        assert f.fallback, f"{f.name}: fallback must not be empty"
+
+
+def test_context_features_no_duplicates():
+    names = [f.name for f in CONTEXT_FEATURES]
+    assert len(names) == len(set(names)), f"duplicate names: {names}"
--- a/ml/features/test_profile_schema.py
+++ b/ml/features/test_profile_schema.py
@@ -1,4 +1,4 @@
-"""Smoke test for profile_schema mirror (#81 phase A).
+"""Smoke test for profile_schema mirror (#81 phase A, #61 freshness spec).

 The TS registry in services/api/src/profile/registry.ts is the source of truth.
 This test checks the names listed here match the registry by reading the TS
@@ -14,6 +14,18 @@ from ml.features.profile_schema import PROFILE_FEATURES, feature_names

 REGISTRY_PATH = Path(__file__).resolve().parents[2] / "services" / "api" / "src" / "profile" / "registry.ts"

+_HOUR = 3600
+_DAY = 86_400
+
+# Expected ttl_sec values mirrored from registry.ts — keeps the two in sync.
+_EXPECTED_TTL: dict[str, int] = {
+    "completion_rate_30d": 6 * _HOUR,
+    "dismiss_rate_30d":    6 * _HOUR,
+    "mean_dwell_ms_30d":   6 * _HOUR,
+    "preferred_hour":      _DAY,
+    "tip_volume_30d":      _HOUR,
+}
+

 def _ts_registry_names() -> set[str]:
    text = REGISTRY_PATH.read_text(encoding="utf-8")
@@ -21,6 +33,35 @@ def _ts_registry_names() -> set[str]:
    return set(re.findall(r"name:\s*'([a-zA-Z0-9_]+)'", text))


+def _ts_registry_ttls() -> dict[str, int]:
+    """Parse ttlSec values from registry.ts (crude but sufficient for drift detection).
+
+    Handles TS symbolic constants (HOUR, DAY) and expressions like ``6 * HOUR``.
+    """
+    text = REGISTRY_PATH.read_text(encoding="utf-8")
+
+    # Extract numeric constants: `const HOUR = 3600;` or `const DAY = 86_400;`
+    consts: dict[str, int] = {}
+    for m in re.finditer(r"const\s+([A-Z_]+)\s*=\s*([\d_]+)", text):
+        consts[m.group(1)] = int(m.group(2).replace("_", ""))
+
+    def _eval_expr(expr: str) -> int:
+        tokens = [t.strip() for t in expr.split("*")]
+        result = 1
+        for t in tokens:
+            result *= consts[t] if t in consts else int(t)
+        return result
+
+    result: dict[str, int] = {}
+    for block in re.split(r"\{", text):
+        name_m = re.search(r"name:\s*'([a-zA-Z0-9_]+)'", block)
+        # ttlSec may be a constant name, a number, or `N * CONST`
+        ttl_m = re.search(r"ttlSec:\s*([A-Za-z0-9_]+(?:\s*\*\s*[A-Za-z0-9_]+)?)", block)
+        if name_m and ttl_m:
+            result[name_m.group(1)] = _eval_expr(ttl_m.group(1))
+    return result
+
+
 def test_python_mirror_matches_ts_registry():
    py_names = feature_names()
    ts_names = _ts_registry_names()
@@ -39,3 +80,34 @@ def test_profile_schema_no_duplicates():
 def test_profile_schema_dtypes_known():
    for f in PROFILE_FEATURES:
        assert f.dtype in {"numeric", "categorical"}
+
+
+def test_all_profile_features_are_batched():
+    for f in PROFILE_FEATURES:
+        assert f.freshness == "batched", f"{f.name}: expected freshness='batched', got {f.freshness!r}"
+
+
+def test_profile_feature_ttl_matches_ts_registry():
+    ts_ttls = _ts_registry_ttls()
+    for f in PROFILE_FEATURES:
+        assert f.name in ts_ttls, f"{f.name} not found in TS registry ttlSec parse"
+        assert f.ttl_sec == ts_ttls[f.name], (
+            f"{f.name}: Python ttl_sec={f.ttl_sec} != TS ttlSec={ts_ttls[f.name]}"
+        )
+
+
+def test_profile_feature_ttl_matches_expected():
+    for f in PROFILE_FEATURES:
+        assert f.ttl_sec == _EXPECTED_TTL[f.name], (
+            f"{f.name}: ttl_sec={f.ttl_sec}, expected {_EXPECTED_TTL[f.name]}"
+        )
+
+
+def test_profile_feature_source_is_profile_store():
+    for f in PROFILE_FEATURES:
+        assert f.source == "profile_store", f"{f.name}: unexpected source {f.source!r}"
+
+
+def test_profile_feature_fallback_set():
+    for f in PROFILE_FEATURES:
+        assert f.fallback, f"{f.name}: fallback must not be empty"
--- a/ml/pipelines/sim_dag.py
+++ b/ml/pipelines/sim_dag.py
@@ -0,0 +1,124 @@
+"""
+Airflow DAG: bandit_sim
+
+Runs a bandit policy simulation and logs results to MLflow.
+Triggered on-demand from the oO admin panel or manually from the Airflow UI.
+
+Required conf keys (passed via dag_run.conf):
+  sim_run_id      str   — oO SQLite run ID for callback correlation
+  n_users         int   — number of synthetic users
+  n_rounds        int   — rounds per user
+  tasks_per_round int   — candidate pool size per round
+  policies        list  — policy names to compare
+  judge_mode      str   — "rule" | "llm"
+  ml_url          str   — ml/serving URL (e.g. http://ml-serving:8000)
+  mlflow_url      str   — MLflow tracking URI (e.g. http://mlflow:5000/mlflow)
+  callback_url    str   — oO API callback endpoint
+  internal_token  str   — x-internal-token header value
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import sys
+from datetime import datetime, timedelta
+
+from airflow import DAG
+from airflow.operators.python import PythonOperator
+
+
+def _run_sim(**context: object) -> dict:
+    conf: dict = context["dag_run"].conf or {}
+
+    n_users        = int(conf.get("n_users", 5))
+    n_rounds       = int(conf.get("n_rounds", 20))
+    tasks_per_round = int(conf.get("tasks_per_round", 8))
+    policies       = list(conf.get("policies", ["linucb-v1", "egreedy-v1"]))
+    judge_mode     = str(conf.get("judge_mode", "rule"))
+    ml_url         = str(conf.get("ml_url", "http://ml-serving:8000"))
+    mlflow_url     = str(conf.get("mlflow_url", os.environ.get("MLFLOW_TRACKING_URI", "")))
+    mlflow_experiment = "bandit_simulation"
+
+    sys.path.insert(0, "/opt/airflow/ml/experiments/sim")
+    from runner import run_simulation  # type: ignore[import]
+
+    use_llm = judge_mode == "llm"
+    result = run_simulation(
+        n_users=n_users,
+        n_rounds=n_rounds,
+        tasks_per_round=tasks_per_round,
+        ml_url=ml_url,
+        policies=policies,
+        use_llm=use_llm,
+        seed=42,
+        mlflow_url=mlflow_url or None,
+        mlflow_experiment=mlflow_experiment,
+    )
+    return result
+
+
+def _callback(**context: object) -> None:
+    import httpx
+
+    conf: dict = context["dag_run"].conf or {}
+    callback_url: str = str(conf.get("callback_url", ""))
+    internal_token: str = str(conf.get("internal_token", ""))
+
+    if not callback_url or not internal_token:
+        print("No callback_url or internal_token — skipping result push.", flush=True)
+        return
+
+    result: dict = context["ti"].xcom_pull(task_ids="run_sim")
+    if not result:
+        print("No result from run_sim task — callback skipped.", flush=True)
+        return
+
+    payload = {
+        "summary":           result.get("summary", {}),
+        "winner":            result.get("winner", ""),
+        "persona_breakdown": result.get("persona_breakdown", {}),
+        "events":            result.get("events", []),
+        "mlflow_run_id":     result.get("mlflow_run_id"),
+    }
+
+    try:
+        r = httpx.post(
+            callback_url,
+            json=payload,
+            headers={"x-internal-token": internal_token},
+            timeout=30.0,
+        )
+        r.raise_for_status()
+        print(f"Callback OK: {r.status_code}", flush=True)
+    except Exception as exc:
+        print(f"Callback failed: {exc}", flush=True)
+        raise
+
+
+with DAG(
+    dag_id="bandit_sim",
+    description="On-demand bandit policy simulation with MLflow tracking",
+    schedule_interval=None,
+    start_date=datetime(2025, 1, 1),
+    catchup=False,
+    tags=["bandit", "simulation", "ml"],
+    default_args={
+        "retries": 1,
+        "retry_delay": timedelta(minutes=2),
+    },
+) as dag:
+
+    run_sim = PythonOperator(
+        task_id="run_sim",
+        python_callable=_run_sim,
+        provide_context=True,
+    )
+
+    push_results = PythonOperator(
+        task_id="push_results",
+        python_callable=_callback,
+        provide_context=True,
+    )
+
+    run_sim >> push_results
--- a/ml/serving/README.md
+++ b/ml/serving/README.md
@@ -0,0 +1,104 @@
+# ml/serving
+
+FastAPI online scorer, tip generator, and JetStream consumer.
+
+## Contract
+
+| Endpoint | Description |
+|----------|-------------|
+| `POST /score` | LinUCB d=5 (baseline, shadow-eligible) |
+| `POST /score/egreedy` | ε-greedy v1, d=7 (active policy — ADR-0007) |
+| `POST /score/egreedy/v2` | ε-greedy v2, d=12 + profile features (shadow — ADR-0012) |
+| `POST /reward` / `/reward/egreedy` / `/reward/egreedy/v2` | Online reward update per policy |
+| `POST /generate` | LLM tip candidates via LiteLLM `tip-generator` alias |
+| `GET /stats/{user_id}` / `/stats/egreedy/{user_id}` / `/stats/egreedy/v2/{user_id}` | Per-user policy stats |
+| `GET /features/{user_id}` | Last 100 scored feature vectors (ring buffer) |
+| `POST /reset/{user_id}` | Clear all per-user bandit state (admin) |
+| `GET /health` | `{ ok, nats: { enabled, consumers: { signals, feedback } } }` |
+
+Called by `services/api/src/recommender/` over HTTP. Contract is stable across policy swaps.
+
+## Feature dimensions
+
+| Policy | d | Extra dims vs previous |
+|--------|---|------------------------|
+| LinUCB v1 | 5 | hour_sin/cos, is_overdue, task_age, priority |
+| ε-greedy v1 | 7 | + dow_sin/cos |
+| ε-greedy v2 | 12 | + 5 profile features (ADR-0012) |
+
+Profile features are computed by the TypeScript API and shipped on each `/score` call as `profile_features`. See `ml/README.md` and ADR-0011.
+
+## JetStream consumers
+
+On startup, `nats_consumer.py` registers two durable push consumers against NATS JetStream:
+
+| Consumer | Stream | Subjects | Durable name |
+|----------|--------|----------|--------------|
+| signals | `signals` | `signals.>` | `feature-pipeline-signals` |
+| feedback | `feedback` | `feedback.>` | `feature-pipeline-feedback` |
+
+**Handled subjects:**
+- `signals.task.synced` — writes `{last_sync_ts, task_count}` to `{STATE_DIR}/{user}_sync.json`
+- `signals.tip.feedback` — logged for observability; reward update happens via the HTTP path in the recommender
+
+**Payload validation:** each message is validated against the pydantic models in `schemas.py` (mirroring `packages/shared-types/events/oo/events/v1/`). A `ValidationError` triggers a nak so the message is redelivered rather than silently dropped.
+
+**Ack semantics:** explicit ack on success; nak for redelivery on error; dead-lettered after `NATS_MAX_DELIVER` attempts.
+
+**Disabled** when `NATS_URL` is unset (default in local dev without NATS). No import of `nats-py` occurs in that case.
+
+## Observability
+
+Logs are structured JSON via **structlog**. Every line includes `level`, `logger`, `timestamp`, and — when a W3C `traceparent` header is present on the incoming request — `trace_id` bound via Python `contextvars`, so all log lines within a request carry the same trace ID as the upstream API call.
+
+Sentry error capture is active when `SENTRY_DSN` is set.
+
+## Config
+
+| Env var | Default | Description |
+|---------|---------|-------------|
+| `STATE_DIR` | `/tmp/oo-bandit-state` | Directory for per-user bandit state JSON files |
+| `LITELLM_URL` | `http://localhost:4000` | LiteLLM gateway |
+| `LITELLM_MASTER_KEY` | `sk-oo-dev` | LiteLLM auth key |
+| `NATS_URL` | `` | NATS broker URL; empty = consumers disabled |
+| `NATS_DURABLE_PREFIX` | `feature-pipeline` | Prefix for durable consumer names |
+| `NATS_MAX_DELIVER` | `5` | Max redelivery attempts before dropping |
+| `DEFAULT_PROMPT_VERSION` | `v1` | Fallback prompt version for `/generate` |
+| `ENV` | `development` | Environment label (passed to Sentry) |
+| `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
+
+## Health story
+
+`GET /health` returns `{ ok: true }` plus NATS consumer state:
+
+```json
+{
+  "ok": true,
+  "nats": {
+    "enabled": true,
+    "consumers": {
+      "signals": { "last_msg_ts": "2026-04-25T10:00:00Z", "processed": 42, "errors": 0 },
+      "feedback": { "last_msg_ts": null, "processed": 0, "errors": 0 }
+    }
+  }
+}
+```
+
+`last_msg_ts` is `null` until the first message arrives. Used by docker-compose healthcheck.
+
+## Extraction criteria
+
+Extract to its own process (already is one). Extract to a dedicated host / GPU node when:
+- p99 scoring latency exceeds 50 ms under load, **or**
+- model weights are too large to share memory with the Python process on the current host.
+
+## State
+
+Per-user bandit state is stored as JSON files in `STATE_DIR`:
+
+| File pattern | Policy |
+|---|---|
+| `{user}.json` | LinUCB v1 |
+| `{user}_egreedy.json` | ε-greedy v1 |
+| `{user}_egreedy_v2.json` | ε-greedy v2 |
+| `{user}_sync.json` | Last task sync metadata (written by JetStream consumer) |
--- a/ml/serving/logging_config.py
+++ b/ml/serving/logging_config.py
@@ -0,0 +1,20 @@
+"""Structlog JSON configuration — import once at process start."""
+import logging
+import structlog
+
+
+def configure() -> None:
+    structlog.configure(
+        processors=[
+            structlog.contextvars.merge_contextvars,
+            structlog.stdlib.add_log_level,
+            structlog.stdlib.add_logger_name,
+            structlog.processors.TimeStamper(fmt="iso"),
+            structlog.processors.StackInfoRenderer(),
+            structlog.processors.JSONRenderer(),
+        ],
+        wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
+        context_class=dict,
+        logger_factory=structlog.PrintLoggerFactory(),
+    )
+    logging.basicConfig(level=logging.WARNING)
--- a/ml/serving/main.py
+++ b/ml/serving/main.py
@@ -28,17 +28,55 @@ import math
 import os
 import time
 from collections import deque
+from contextlib import asynccontextmanager
 from pathlib import Path
 from typing import Optional, Deque

 import httpx
 import numpy as np
-from fastapi import FastAPI, HTTPException
+import sentry_sdk
+import structlog
+import structlog.contextvars
+from fastapi import FastAPI, HTTPException, Request
 from pydantic import BaseModel
+from starlette.middleware.base import BaseHTTPMiddleware

+import logging_config
+import nats_consumer
 from prompts import get_prompt

-app = FastAPI(title="oO ML Serving", version="1.0.0")
+logging_config.configure()
+
+_SENTRY_DSN = os.getenv("SENTRY_DSN")
+if _SENTRY_DSN:
+    sentry_sdk.init(dsn=_SENTRY_DSN, environment=os.getenv("ENV", "development"))
+
+log = structlog.get_logger()
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    await nats_consumer.start(STATE_DIR)
+    yield
+    await nats_consumer.stop()
+
+
+app = FastAPI(title="oO ML Serving", version="1.0.0", lifespan=lifespan)
+
+
+class _TracingMiddleware(BaseHTTPMiddleware):
+    async def dispatch(self, request: Request, call_next):
+        structlog.contextvars.clear_contextvars()
+        traceparent = request.headers.get("traceparent", "")
+        if traceparent:
+            parts = traceparent.split("-")
+            trace_id = parts[1] if len(parts) == 4 and len(parts[1]) == 32 else None
+            if trace_id:
+                structlog.contextvars.bind_contextvars(trace_id=trace_id)
+        return await call_next(request)
+
+
+app.add_middleware(_TracingMiddleware)

 LITELLM_URL = os.getenv("LITELLM_URL", "http://localhost:4000")
 LITELLM_MASTER_KEY = os.getenv("LITELLM_MASTER_KEY", "sk-oo-dev")
@@ -315,7 +353,13 @@ class GenerateResponse(BaseModel):

@app.get("/health")
 def health():
-    return {"ok": True}
+    return {
+        "ok": True,
+        "nats": {
+            "enabled": bool(nats_consumer.NATS_URL),
+            "consumers": nats_consumer.consumer_health,
+        },
+    }


 _RETRY_SUFFIX = (
--- a/ml/serving/nats_consumer.py
+++ b/ml/serving/nats_consumer.py
@@ -0,0 +1,146 @@
+"""
+JetStream durable consumers for ml/serving.
+
+Streams:
+  signals  (subjects: signals.>) — durable: {prefix}-signals
+  feedback (subjects: feedback.>) — durable: {prefix}-feedback
+
+Handled subjects:
+  signals.task.synced   → write per-user sync metadata to STATE_DIR
+  signals.tip.feedback  → log for observability (reward is applied via HTTP path)
+
+Config (env vars):
+  NATS_URL            — broker URL; empty = consumers disabled (default: "")
+  NATS_DURABLE_PREFIX — prefix for durable consumer names (default: "feature-pipeline")
+  NATS_MAX_DELIVER    — max redelivery attempts before dropping (default: 5)
+"""
+from __future__ import annotations
+
+import json
+import os
+import time
+from pathlib import Path
+from typing import Optional
+
+import structlog
+from schemas import TaskSyncedPayload, TipFeedbackPayload
+
+log = structlog.get_logger(__name__)
+
+NATS_URL = os.getenv("NATS_URL", "")
+NATS_DURABLE_PREFIX = os.getenv("NATS_DURABLE_PREFIX", "feature-pipeline")
+NATS_MAX_DELIVER = int(os.getenv("NATS_MAX_DELIVER", "5"))
+
+# Exposed to /health
+consumer_health: dict[str, dict] = {
+    "signals": {"last_msg_ts": None, "processed": 0, "errors": 0},
+    "feedback": {"last_msg_ts": None, "processed": 0, "errors": 0},
+}
+
+_nc = None          # nats.aio.Client
+_subs: list = []    # active JetStream subscriptions
+
+
+# ── Subject handlers ───────────────────────────────────────────────────────
+
+def _sync_meta_path(state_dir: Path, user_id: str) -> Path:
+    safe = "".join(c if c.isalnum() else "_" for c in user_id)
+    return state_dir / f"{safe}_sync.json"
+
+
+async def _handle(subject: str, payload: dict, state_dir: Path) -> None:
+    if subject == "signals.task.synced":
+        msg = TaskSyncedPayload.model_validate(payload)
+        p = _sync_meta_path(state_dir, msg.userId)
+        p.write_text(json.dumps({
+            "last_sync_ts": msg.syncedAt,
+            "task_count": msg.count,
+        }))
+        log.info("nats: task_synced", user_id=msg.userId, count=msg.count)
+    elif subject == "signals.tip.feedback":
+        msg = TipFeedbackPayload.model_validate(payload)
+        log.info("nats: tip_feedback", user_id=msg.userId, tip_id=msg.tipId, action=msg.action, reward=msg.reward)
+    else:
+        log.debug("nats: unhandled subject", subject=subject)
+
+
+# ── Consumer factory ───────────────────────────────────────────────────────
+
+def _make_handler(key: str, state_dir: Path):
+    """Return an async push-consumer callback that acks on success, naks on error."""
+    async def handler(msg) -> None:
+        consumer_health[key]["last_msg_ts"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+        try:
+            payload = json.loads(msg.data)
+            await _handle(msg.subject, payload, state_dir)
+            await msg.ack()
+            consumer_health[key]["processed"] += 1
+        except Exception as exc:
+            consumer_health[key]["errors"] += 1
+            log.warning("nats: processing error", key=key, subject=msg.subject, exc=str(exc))
+            await msg.nak()
+    return handler
+
+
+# ── Lifecycle ──────────────────────────────────────────────────────────────
+
+async def start(state_dir: Path) -> None:
+    """Connect to NATS and register durable push consumers. No-op if NATS_URL is unset."""
+    global _nc
+    if not NATS_URL:
+        log.info("nats: NATS_URL unset — JetStream consumers disabled")
+        return
+
+    try:
+        import nats as nats_lib
+        from nats.js.api import ConsumerConfig, AckPolicy
+
+        _nc = await nats_lib.connect(
+            NATS_URL,
+            name="ml-serving",
+            reconnect_time_wait=5,
+            max_reconnect_attempts=-1,
+        )
+        js = _nc.jetstream()
+        log.info("nats: connected", url=NATS_URL)
+    except Exception as exc:
+        log.warning("nats: connection failed — consumers disabled", exc=str(exc))
+        _nc = None
+        return
+
+    config = ConsumerConfig(
+        ack_policy=AckPolicy.EXPLICIT,
+        max_deliver=NATS_MAX_DELIVER,
+    )
+
+    for key, subject in [("signals", "signals.>"), ("feedback", "feedback.>")]:
+        durable = f"{NATS_DURABLE_PREFIX}-{key}"
+        try:
+            sub = await js.subscribe(
+                subject,
+                durable=durable,
+                cb=_make_handler(key, state_dir),
+                config=config,
+            )
+            _subs.append(sub)
+            log.info("nats: subscribed", subject=subject, durable=durable)
+        except Exception as exc:
+            log.warning("nats: subscribe failed", key=key, exc=str(exc))
+
+
+async def stop() -> None:
+    """Drain subscriptions and close NATS connection."""
+    global _nc
+    for sub in _subs:
+        try:
+            await sub.unsubscribe()
+        except Exception:
+            pass
+    _subs.clear()
+    if _nc:
+        try:
+            await _nc.drain()
+        except Exception:
+            pass
+        _nc = None
+        log.info("nats: disconnected")
--- a/ml/serving/requirements.txt
+++ b/ml/serving/requirements.txt
@@ -4,3 +4,6 @@ pydantic==2.10.4
 numpy>=1.26.0
 httpx>=0.27.0
 anthropic>=0.40.0
+nats-py>=2.9.0
+structlog>=24.1.0
+sentry-sdk>=2.0.0
--- a/ml/serving/schemas.py
+++ b/ml/serving/schemas.py
@@ -0,0 +1,50 @@
+"""
+Pydantic models mirroring oo.events.v1 proto schemas.
+
+Field names use camelCase to match the proto3 JSON mapping convention
+and the TypeScript payload shapes published by services/api.
+
+Keep in sync with packages/shared-types/events/oo/events/v1/.
+"""
+from __future__ import annotations
+
+from typing import Literal, Optional
+from pydantic import BaseModel
+
+
+class TaskSyncedPayload(BaseModel):
+    userId: str
+    source: str
+    count: int
+    syncedAt: str
+
+
+class TipServedPayload(BaseModel):
+    userId: str
+    tipId: str
+    policy: str
+    servedAt: str
+
+
+class TipFeedbackPayload(BaseModel):
+    userId: str
+    tipId: str
+    action: Literal['done', 'dismiss', 'snooze', 'helpful', 'not_helpful']
+    reward: float
+    dwellMs: Optional[int] = None
+    createdAt: str
+
+
+class TipRewardFailedPayload(BaseModel):
+    userId: str
+    tipId: str
+    reward: float
+    attempts: int
+    error: str
+    failedAt: str
+
+
+class IntegrationTokenExpiredPayload(BaseModel):
+    userId: str
+    provider: str
+    detectedAt: str
--- a/ml/serving/tests/test_schemas_and_consumer.py
+++ b/ml/serving/tests/test_schemas_and_consumer.py
@@ -0,0 +1,169 @@
+"""
+Tests for schemas.py and nats_consumer._handle.
+"""
+import json
+import pytest
+import tempfile
+from pathlib import Path
+from pydantic import ValidationError
+from unittest.mock import AsyncMock
+
+from schemas import (
+    TaskSyncedPayload,
+    TipServedPayload,
+    TipFeedbackPayload,
+    TipRewardFailedPayload,
+    IntegrationTokenExpiredPayload,
+)
+from nats_consumer import _handle, _sync_meta_path
+
+
+# ── Schema validation ─────────────────────────────────────────────────────────
+
+class TestTaskSyncedPayload:
+    def test_valid(self):
+        p = TaskSyncedPayload.model_validate(
+            {"userId": "u1", "source": "todoist", "count": 5, "syncedAt": "2026-04-25T10:00:00Z"}
+        )
+        assert p.userId == "u1"
+        assert p.count == 5
+
+    def test_missing_field_raises(self):
+        with pytest.raises(ValidationError):
+            TaskSyncedPayload.model_validate({"userId": "u1", "source": "todoist"})
+
+    def test_wrong_type_raises(self):
+        with pytest.raises(ValidationError):
+            TaskSyncedPayload.model_validate(
+                {"userId": "u1", "source": "todoist", "count": "not-an-int", "syncedAt": "2026-04-25T10:00:00Z"}
+            )
+
+
+class TestTipFeedbackPayload:
+    def test_valid_without_dwell(self):
+        p = TipFeedbackPayload.model_validate(
+            {"userId": "u1", "tipId": "t1", "action": "done", "reward": 1.0, "createdAt": "2026-04-25T10:00:00Z"}
+        )
+        assert p.dwellMs is None
+
+    def test_valid_with_dwell(self):
+        p = TipFeedbackPayload.model_validate(
+            {"userId": "u1", "tipId": "t1", "action": "helpful", "reward": 0.5,
+             "dwellMs": 3200, "createdAt": "2026-04-25T10:00:00Z"}
+        )
+        assert p.dwellMs == 3200
+
+    def test_invalid_action_raises(self):
+        with pytest.raises(ValidationError):
+            TipFeedbackPayload.model_validate(
+                {"userId": "u1", "tipId": "t1", "action": "like", "reward": 1.0, "createdAt": "2026-04-25T10:00:00Z"}
+            )
+
+    def test_all_valid_actions(self):
+        for action in ("done", "dismiss", "snooze", "helpful", "not_helpful"):
+            p = TipFeedbackPayload.model_validate(
+                {"userId": "u1", "tipId": "t1", "action": action, "reward": 0.0, "createdAt": "2026-04-25T10:00:00Z"}
+            )
+            assert p.action == action
+
+
+class TestOtherPayloads:
+    def test_tip_served(self):
+        p = TipServedPayload.model_validate(
+            {"userId": "u1", "tipId": "t1", "policy": "egreedy-v2", "servedAt": "2026-04-25T10:00:00Z"}
+        )
+        assert p.policy == "egreedy-v2"
+
+    def test_tip_reward_failed(self):
+        p = TipRewardFailedPayload.model_validate(
+            {"userId": "u1", "tipId": "t1", "reward": 1.0, "attempts": 3,
+             "error": "timeout", "failedAt": "2026-04-25T10:00:00Z"}
+        )
+        assert p.attempts == 3
+
+    def test_integration_token_expired(self):
+        p = IntegrationTokenExpiredPayload.model_validate(
+            {"userId": "u1", "provider": "todoist", "detectedAt": "2026-04-25T10:00:00Z"}
+        )
+        assert p.provider == "todoist"
+
+
+# ── _handle behaviour ─────────────────────────────────────────────────────────
+
+TASK_SYNCED = {
+    "userId": "user-abc",
+    "source": "todoist",
+    "count": 7,
+    "syncedAt": "2026-04-25T10:00:00Z",
+}
+
+TIP_FEEDBACK = {
+    "userId": "user-abc",
+    "tipId": "tip-xyz",
+    "action": "done",
+    "reward": 1.0,
+    "dwellMs": 4200,
+    "createdAt": "2026-04-25T10:00:00Z",
+}
+
+
+class TestHandle:
+    @pytest.mark.asyncio
+    async def test_task_synced_writes_meta_file(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            state_dir = Path(tmp)
+            await _handle("signals.task.synced", TASK_SYNCED, state_dir)
+            meta_path = _sync_meta_path(state_dir, "user-abc")
+            assert meta_path.exists()
+            data = json.loads(meta_path.read_text())
+            assert data["task_count"] == 7
+            assert data["last_sync_ts"] == "2026-04-25T10:00:00Z"
+
+    @pytest.mark.asyncio
+    async def test_task_synced_bad_payload_raises(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            with pytest.raises(ValidationError):
+                await _handle("signals.task.synced", {"userId": "u1"}, Path(tmp))
+
+    @pytest.mark.asyncio
+    async def test_tip_feedback_valid_does_not_raise(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            # should log and return cleanly
+            await _handle("signals.tip.feedback", TIP_FEEDBACK, Path(tmp))
+
+    @pytest.mark.asyncio
+    async def test_tip_feedback_bad_action_raises(self):
+        bad = {**TIP_FEEDBACK, "action": "unknown"}
+        with tempfile.TemporaryDirectory() as tmp:
+            with pytest.raises(ValidationError):
+                await _handle("signals.tip.feedback", bad, Path(tmp))
+
+    @pytest.mark.asyncio
+    async def test_unhandled_subject_is_ignored(self):
+        with tempfile.TemporaryDirectory() as tmp:
+            # should not raise for unknown subjects
+            await _handle("signals.something.new", {"any": "data"}, Path(tmp))
+
+    @pytest.mark.asyncio
+    async def test_make_handler_acks_on_success(self):
+        from nats_consumer import _make_handler
+        with tempfile.TemporaryDirectory() as tmp:
+            handler = _make_handler("signals", Path(tmp))
+            msg = AsyncMock()
+            msg.subject = "signals.task.synced"
+            msg.data = json.dumps(TASK_SYNCED).encode()
+            await handler(msg)
+            msg.ack.assert_awaited_once()
+            msg.nak.assert_not_awaited()
+
+    @pytest.mark.asyncio
+    async def test_make_handler_naks_on_validation_error(self):
+        from nats_consumer import _make_handler
+        with tempfile.TemporaryDirectory() as tmp:
+            handler = _make_handler("signals", Path(tmp))
+            msg = AsyncMock()
+            msg.subject = "signals.task.synced"
+            msg.data = json.dumps({"userId": "u1"}).encode()  # missing fields
+            await handler(msg)
+            msg.nak.assert_awaited_once()
+            msg.ack.assert_not_awaited()
--- a/packages/shared-types/README.md
+++ b/packages/shared-types/README.md
@@ -0,0 +1,63 @@
+# @oo/shared-types
+
+Canonical contracts for all inter-module communication. Two surfaces:
+
+| Surface | Format | Location |
+|---------|--------|----------|
+| HTTP (sync) | OpenAPI / TypeScript interfaces | `src/http/` |
+| Events (async) | Protocol Buffers + TS interfaces | `src/events/`, `events/` |
+
+## HTTP types
+
+Hand-written TypeScript interfaces generated from OpenAPI specs. Imported by
+`services/api`, `apps/web`, and `ml/serving` (Python hand-mirrors).
+
+| File | Types |
+|------|-------|
+| `src/http/tip.ts` | `TipCandidate`, `RecommendResponse`, `TipFeedback` |
+| `src/http/auth.ts` | `SessionUser` |
+| `src/http/integrations.ts` | `IntegrationsResponse`, `Integration` |
+| `src/http/user.ts` | `UserProfile` |
+| `src/http/signal.ts` | `Signal`, `SignalSource` |
+
+## Event types
+
+Protobuf schemas live in `events/oo/events/v1/`. TypeScript interfaces in
+`src/events/index.ts` mirror the proto envelope and payload types.
+
+| Proto file | Messages |
+|------------|----------|
+| `envelope.proto` | `Envelope` (wraps every event) |
+| `signals.proto` | `TaskSyncedPayload`, `TipServedPayload`, `TipFeedbackPayload`, `TipRewardFailedPayload` |
+| `integration.proto` | `IntegrationTokenExpiredPayload` |
+
+**Schema evolution rules (ADR-0005):**
+- Additive changes only within a version (new fields, new message types).
+- Removed fields must be marked `reserved` — never reuse a field number.
+- Breaking changes require a new package version (`oo.events.v2`) and a `schemaVersion` bump in the envelope.
+
+## Schema registry / CI gate
+
+`buf` enforces lint and breaking-change detection on every PR that touches `events/`:
+
+```bash
+# Lint
+buf lint events/
+
+# Breaking-change check against main
+buf breaking events/ --against '.git#branch=main,subdir=packages/shared-types/events'
+```
+
+Local shortcut: `./scripts/buf-check.sh`
+
+CI: `.gitea/workflows/buf-check.yaml` (requires a Gitea Actions runner).
+
+Install buf: `curl -sSfL https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64 -o /usr/local/bin/buf && chmod +x /usr/local/bin/buf`
+
+## Contract
+
+`/health` — not applicable (library package, no process).
+
+**Extraction criteria** — always a shared library. Extract to a separate registry
+service only when schema governance requires independent versioning and deployment
+(e.g. external consumers, SLA divergence from the monorepo).
--- a/packages/shared-types/events/buf.yaml
+++ b/packages/shared-types/events/buf.yaml
@@ -0,0 +1,7 @@
+version: v1
+lint:
+  use:
+    - STANDARD
+breaking:
+  use:
+    - FILE
--- a/packages/shared-types/events/oo/events/v1/envelope.proto
+++ b/packages/shared-types/events/oo/events/v1/envelope.proto
@@ -0,0 +1,25 @@
+syntax = "proto3";
+package oo.events.v1;
+
+import "oo/events/v1/signals.proto";
+import "oo/events/v1/integration.proto";
+
+// Envelope wraps every event on the bus and on NATS JetStream.
+// Wire format: proto3 JSON (camelCase field names).
+// schema_version = "v1" — bump to "v2" only for breaking payload changes.
+message Envelope {
+  string event_id       = 1;  // UUID assigned by bus on publish
+  string occurred_at    = 2;  // ISO 8601
+  string schema_version = 3;  // "v1"
+  string producer       = 4;  // e.g. "services/api"
+  string subject        = 5;  // NATS-style subject: domain.entity.verb
+  uint64 seq            = 6;  // monotonic sequence from the bus ring
+
+  oneof payload {
+    TaskSyncedPayload              task_synced              = 10;
+    TipServedPayload               tip_served               = 11;
+    TipFeedbackPayload             tip_feedback             = 12;
+    TipRewardFailedPayload         tip_reward_failed        = 13;
+    IntegrationTokenExpiredPayload integration_token_expired = 14;
+  }
+}
--- a/packages/shared-types/events/oo/events/v1/integration.proto
+++ b/packages/shared-types/events/oo/events/v1/integration.proto
@@ -0,0 +1,9 @@
+syntax = "proto3";
+package oo.events.v1;
+
+// subject: signals.integration.token_expired
+message IntegrationTokenExpiredPayload {
+  string user_id     = 1;
+  string provider    = 2;
+  string detected_at = 3;  // ISO 8601
+}
--- a/packages/shared-types/events/oo/events/v1/signals.proto
+++ b/packages/shared-types/events/oo/events/v1/signals.proto
@@ -0,0 +1,39 @@
+syntax = "proto3";
+package oo.events.v1;
+
+// subject: signals.task.synced
+message TaskSyncedPayload {
+  string user_id   = 1;
+  string source    = 2;  // e.g. "todoist"
+  int32  count     = 3;
+  string synced_at = 4;  // ISO 8601
+}
+
+// subject: signals.tip.served
+message TipServedPayload {
+  string user_id   = 1;
+  string tip_id    = 2;
+  string policy    = 3;
+  string served_at = 4;  // ISO 8601
+}
+
+// subject: signals.tip.feedback
+// action: done | dismiss | snooze | helpful | not_helpful
+message TipFeedbackPayload {
+  string         user_id    = 1;
+  string         tip_id     = 2;
+  string         action     = 3;
+  double         reward     = 4;
+  optional int64 dwell_ms   = 5;  // null when no dwell was recorded
+  string         created_at = 6;  // ISO 8601
+}
+
+// subject: signals.tip.reward_failed
+message TipRewardFailedPayload {
+  string user_id   = 1;
+  string tip_id    = 2;
+  double reward    = 3;
+  int32  attempts  = 4;
+  string error     = 5;
+  string failed_at = 6;  // ISO 8601
+}
--- a/packages/shared-types/package.json
+++ b/packages/shared-types/package.json
@@ -15,7 +15,9 @@
    "test": "vitest run",
    "test:watch": "vitest",
    "type-check": "tsc --noEmit",
-    "clean": "rm -rf dist"
+    "clean": "rm -rf dist",
+    "buf:lint": "buf lint events",
+    "buf:breaking": "buf breaking events --against '.git#branch=main,subdir=packages/shared-types/events'"
  },
  "devDependencies": {
    "@vitest/coverage-v8": "^4.1.4",
--- a/packages/shared-types/src/events/index.ts
+++ b/packages/shared-types/src/events/index.ts
@@ -1,6 +1,6 @@
 /**
 * NormalizedEvent — the durable envelope for all events flowing through
- * the system. Today: in-process EventEmitter. Tomorrow: NATS JetStream.
+ * the system. Mirrors oo.events.v1.Envelope in packages/shared-types/events/.
 *
 * Subject taxonomy:
 *   signals.task.synced      — Todoist (or other source) task list refreshed
@@ -10,10 +10,16 @@
 *   signals.integration.token_expired — OAuth token needs reconnect
 */
 export interface NormalizedEvent<T = unknown> {
+  /** UUID assigned by bus on publish */
+  eventId: string;
  /** NATS-style subject: domain.entity.verb */
  subject: string;
  /** ISO 8601 timestamp */
-  ts: string;
+  occurredAt: string;
+  /** "v1" — bump for breaking payload changes; see packages/shared-types/events/ */
+  schemaVersion: 'v1';
+  /** e.g. "services/api" */
+  producer: string;
  /** Monotonically increasing sequence number (in-process ring; JetStream seq in prod) */
  seq: number;
  payload: T;
--- a/packages/shared-types/tsconfig.json
+++ b/packages/shared-types/tsconfig.json
@@ -4,5 +4,6 @@
    "outDir": "dist",
    "rootDir": "src"
  },
-  "include": ["src"]
+  "include": ["src"],
+  "exclude": ["src/__tests__", "**/*.test.ts"]
 }
--- a/pnpm-lock.yaml
+++ b/pnpm-lock.yaml
--- a/scripts/buf-check.sh
+++ b/scripts/buf-check.sh
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+# Run buf lint and breaking-change detection locally.
+# Usage: ./scripts/buf-check.sh [against-branch]
+# Default against-branch: main
+set -euo pipefail
+
+AGAINST="${1:-main}"
+ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+EVENTS="$ROOT/packages/shared-types/events"
+
+if ! command -v buf &>/dev/null; then
+  echo "buf not found. Install: https://buf.build/docs/installation"
+  echo "  curl -sSfL https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64 -o /usr/local/bin/buf && chmod +x /usr/local/bin/buf"
+  exit 1
+fi
+
+echo "==> buf lint"
+buf lint "$EVENTS"
+
+echo "==> buf breaking against $AGAINST"
+buf breaking "$EVENTS" \
+  --against ".git#branch=${AGAINST},subdir=packages/shared-types/events"
+
+echo "All checks passed."
--- a/services/api/README.md
+++ b/services/api/README.md
@@ -0,0 +1,91 @@
+# services/api
+
+Express BFF that serves all client-facing routes, manages sessions, runs background signal sync, and proxies admin calls to `ml/serving`.
+
+## Contract
+
+```
+GET  /health                             { ok: true }
+
+POST /api/auth/login                     → redirect to Google OAuth
+GET  /api/auth/callback                  OAuth return URL
+POST /api/auth/logout
+GET  /api/auth/session                   → { user? }
+POST /api/auth/token                     { token } → set sid cookie (ADMIN_TOKEN auth)
+
+GET  /api/integrations                   list connected integrations
+POST /api/integrations/todoist/connect   start Todoist OAuth
+GET  /api/integrations/todoist/callback
+DELETE /api/integrations/:provider       disconnect
+
+POST /api/recommend                      → { tip }
+POST /api/tip/:id/feedback               { action } → { ok }
+
+GET  /api/user/profile
+DELETE /api/user                         account deletion
+
+POST /api/push/subscribe
+DELETE /api/push/subscribe
+
+GET  /api/admin/stats                    DAU/WAU, feedback breakdown
+GET  /api/admin/users
+GET  /api/admin/events                   recent event stream (ring buffer)
+GET  /api/admin/sim/runs                 offline sim run list
+POST /api/admin/sim/run                  launch offline sim
+GET  /api/admin/sim/runs/:id/output      tail sim stdout
+...
+
+GET  /api/ml/*                           admin-only proxy to ml/serving
+```
+
+## Middleware stack (request order)
+
+1. `cors` — origin limited to `WEB_BASE_URL`
+2. `tracingMiddleware` — reads or generates W3C `traceparent`; sets `req.traceId` + `req.traceparent`
+3. `pinoHttp` — structured JSON request/response logs with `traceId` field; `/health` suppressed
+4. `express.json()` / `cookieParser`
+5. `sessionMiddleware` — validates `sid` cookie, attaches `req.userId`
+
+## Observability
+
+Logs are structured JSON via **pino**. Every line includes `traceId` (extracted from the incoming W3C `traceparent` header, or generated fresh). The same `traceparent` is forwarded on all outbound HTTP calls to `ml/serving` so traces correlate end-to-end.
+
+Sentry error capture is active when `SENTRY_DSN` is set.
+
+## Background tasks
+
+- **Todoist sync scheduler** — runs every `TODOIST_SYNC_INTERVAL_MS` (default 15 min); starts 10 s after boot to avoid startup surge.
+- **Retention purge** — deletes `tipScores` and `tipFeedback` rows older than 30 days; runs on boot and daily.
+- **Profile TTL invalidation** — listens to `signals.task.synced` and `signals.tip.feedback` on the in-process Bus; invalidates cached user-level profile features so the next `/recommend` gets fresh values.
+
+## Config
+
+| Env var | Default | Description |
+|---------|---------|-------------|
+| `PORT` | `3001` | Listen port |
+| `NODE_ENV` | `development` | Environment label |
+| `DATABASE_PATH` | `./data/oo.db` | SQLite file |
+| `SESSION_SECRET` | required | Cookie signing secret |
+| `GOOGLE_CLIENT_ID/SECRET` | required | OAuth |
+| `TODOIST_CLIENT_ID/SECRET` | required | OAuth |
+| `API_BASE_URL` | `http://localhost:3001` | Self-referential redirect URI |
+| `WEB_BASE_URL` | `http://localhost:3000` | CORS + post-login redirect |
+| `ML_SERVING_URL` | `http://localhost:8000` | ml/serving base URL |
+| `NATS_URL` | `` | NATS broker; empty = in-process bus only |
+| `TODOIST_SYNC_INTERVAL_MS` | `900000` | Background sync cadence |
+| `TIP_PROMPT_VERSION` | `` | Prompt variant(s) for `/generate` |
+| `LOG_LEVEL` | `info` | pino log level |
+| `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
+| `VAPID_*` | | Web push keys |
+| `ADMIN_TOKEN` | `` | Static token for service/Playwright admin auth; empty = disabled |
+
+## Health story
+
+`GET /health` returns `{ ok: true }`. No dependency checks — upstream deps (`ml/serving`, NATS) have their own health endpoints checked separately.
+
+## Extraction criteria
+
+Extract to its own host when:
+- Auth session management needs a dedicated Redis/PG session store, **or**
+- Background sync load (Todoist, future connectors) displaces API serving on the shared host, **or**
+- Team boundary emerges between auth/BFF and recommender orchestration.
--- a/services/api/package.json
+++ b/services/api/package.json
@@ -16,6 +16,7 @@
  },
  "dependencies": {
    "@oo/shared-types": "workspace:*",
+    "@sentry/node": "^10.50.0",
    "better-sqlite3": "^11.8.1",
    "cookie-parser": "^1.4.7",
    "cors": "^2.8.5",
@@ -27,6 +28,8 @@
    "nats": "^2.29.3",
    "node-fetch": "^3.3.2",
    "openid-client": "^6.3.4",
+    "pino": "^10.3.1",
+    "pino-http": "^11.0.0",
    "web-push": "^3.6.7",
    "zod": "^3.24.1"
  },
--- a/services/api/src/config.ts
+++ b/services/api/src/config.ts
@@ -34,6 +34,17 @@ export const config = {
  ML_SERVING_URL: optional('ML_SERVING_URL', 'http://localhost:8000'),
  LITELLM_URL: optional('LITELLM_URL', 'http://localhost:4000'),

+  MLFLOW_URL: optional('MLFLOW_URL', 'http://localhost:5000'),
+  AIRFLOW_URL: optional('AIRFLOW_URL', 'http://localhost:8080'),
+  AIRFLOW_API_USER: optional('AIRFLOW_API_USER', 'admin'),
+  AIRFLOW_API_PASSWORD: optional('AIRFLOW_API_PASSWORD', 'admin'),
+
+  /** Shared secret for internal Airflow→API callbacks. */
+  INTERNAL_API_TOKEN: optional('INTERNAL_API_TOKEN', ''),
+
+  /** Static token for automated/service access to the admin panel (e.g. Playwright tests). */
+  ADMIN_TOKEN: optional('ADMIN_TOKEN', ''),
+
  VAPID_PUBLIC_KEY: optional('VAPID_PUBLIC_KEY', ''),
  VAPID_PRIVATE_KEY: optional('VAPID_PRIVATE_KEY', ''),
  VAPID_SUBJECT: optional('VAPID_SUBJECT', 'mailto:admin@localhost'),
--- a/services/api/src/db/index.ts
+++ b/services/api/src/db/index.ts
@@ -156,6 +156,10 @@ export function runMigrations() {
    `ALTER TABLE tip_scores ADD COLUMN prompt_version TEXT`,
    `ALTER TABLE tip_scores ADD COLUMN llm_model TEXT`,
    `ALTER TABLE tip_scores ADD COLUMN tip_kind TEXT`,
+    `ALTER TABLE sim_runs ADD COLUMN airflow_dag_run_id TEXT`,
+    `ALTER TABLE sim_runs ADD COLUMN mlflow_run_id TEXT`,
+    `ALTER TABLE sim_runs ADD COLUMN judge_mode TEXT NOT NULL DEFAULT 'rule'`,
+    `ALTER TABLE sim_runs ADD COLUMN n_policies INTEGER NOT NULL DEFAULT 2`,
  ]) {
    try { sqlite.exec(stmt); } catch { /* column already exists */ }
  }
--- a/services/api/src/db/schema.ts
+++ b/services/api/src/db/schema.ts
@@ -112,9 +112,13 @@ export const simRuns = sqliteTable('sim_runs', {
  tasksPerRound: integer('tasks_per_round').notNull().default(8),
  useLlm: integer('use_llm', { mode: 'boolean' }).notNull().default(false),
  status: text('status').notNull().default('pending'),  // 'pending'|'running'|'done'|'failed'
+  judgeMode: text('judge_mode').notNull().default('rule'),
+  nPolicies: integer('n_policies').notNull().default(2),
  summaryJson: text('summary_json'),           // JSON: { [policy]: PolicySummary }
  winner: text('winner'),
  personaBreakdownJson: text('persona_breakdown_json'), // JSON: { [persona]: { [policy]: {reward,n} } }
+  airflowDagRunId: text('airflow_dag_run_id'),
+  mlflowRunId: text('mlflow_run_id'),
  createdAt: text('created_at').notNull(),
  finishedAt: text('finished_at'),
 });
--- a/services/api/src/events/tests/bus.test.ts
+++ b/services/api/src/events/tests/bus.test.ts
@@ -56,7 +56,7 @@ describe('EventBus — delivery', () => {
  it('does not throw when publishing with no subscribers', () => {
    const b = makeBus();
    expect(() =>
-      b.publish('signals.task.synced', { userId: 'u', count: 3, syncedAt: '' }),
+      b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 3, syncedAt: '' }),
    ).not.toThrow();
  });

@@ -101,7 +101,7 @@ describe('EventBus — ring buffer / tail()', () => {
  it('tail() filters by subject prefix', () => {
    const b = makeBus();
    b.publish('signals.tip.served', { userId: 'u', tipId: 't', policy: 'p', servedAt: '' });
-    b.publish('signals.task.synced', { userId: 'u', count: 1, syncedAt: '' });
+    b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 1, syncedAt: '' });

    const tipEvents = b.tail({ subject: 'signals.tip' });
    expect(tipEvents.every((e) => e.subject.startsWith('signals.tip'))).toBe(true);
@@ -178,7 +178,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
    const hook = vi.fn();
    b.onPublish(hook);

-    const payload = { userId: 'u', count: 2, syncedAt: 'now' };
+    const payload = { userId: 'u', source: 'todoist', count: 2, syncedAt: 'now' };
    b.publish('signals.task.synced', payload);

    expect(hook).toHaveBeenCalledOnce();
@@ -191,7 +191,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
    b.onPublish(() => calls.push('a'));
    b.onPublish(() => calls.push('b'));

-    b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' });
+    b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' });
    expect(calls).toEqual(['a', 'b']);
  });

@@ -202,7 +202,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
    b.onPublish(hook);
    b.subscribe('signals.task.synced', sub);

-    b.publish('signals.task.synced', { userId: 'u', count: 1, syncedAt: '' });
+    b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 1, syncedAt: '' });
    expect(hook).toHaveBeenCalledOnce();
    expect(sub).toHaveBeenCalledOnce();
  });
@@ -215,7 +215,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
      throw new Error('boom');
    });
    expect(() =>
-      b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' }),
+      b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' }),
    ).toThrow('boom');
  });
 });
--- a/services/api/src/events/tests/nats.test.ts
+++ b/services/api/src/events/tests/nats.test.ts
@@ -106,7 +106,7 @@ describe('connectNats — bridge bus → JetStream', () => {

    await connectNats('nats://test:4222');

-    const payload = { userId: 'u1', count: 7, syncedAt: '2026-01-01T00:00:00Z' };
+    const payload = { userId: 'u1', source: 'todoist', count: 7, syncedAt: '2026-01-01T00:00:00Z' };
    bus.publish('signals.task.synced', payload);

    // Allow the queued microtask in the hook to flush.
@@ -121,16 +121,17 @@ describe('connectNats — bridge bus → JetStream', () => {

  it('swallows JetStream publish errors so the in-process bus keeps working', async () => {
    const { connectNats } = await import('../nats.js');
+    const { logger } = await import('../../logger.js');
    const { bus } = await import('../bus.js');

    await connectNats('nats://test:4222');

    // Force the next js.publish to reject.
    lastJsPublish.mockRejectedValueOnce(new Error('jetstream down'));
-    const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
+    const errSpy = vi.spyOn(logger, 'error');

    expect(() =>
-      bus.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' }),
+      bus.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' }),
    ).not.toThrow();

    // Wait a tick for the rejected promise's catch to run.
@@ -142,12 +143,16 @@ describe('connectNats — bridge bus → JetStream', () => {
 describe('connectNats — failure mode', () => {
  it('logs a warning and stays silent when connect rejects', async () => {
    const { connectNats } = await import('../nats.js');
+    const { logger } = await import('../../logger.js');

    lastConnect.mockRejectedValueOnce(new Error('ECONNREFUSED'));
-    const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {});
+    const warnSpy = vi.spyOn(logger, 'warn');

    await expect(connectNats('nats://nope:4222')).resolves.toBeUndefined();
-    expect(warnSpy).toHaveBeenCalledWith(expect.stringContaining('connection failed'));
+    expect(warnSpy).toHaveBeenCalledWith(
+      expect.objectContaining({ err: expect.anything() }),
+      expect.stringContaining('connection failed'),
+    );
  });
 });

@@ -156,7 +161,7 @@ describe('Bus.onPublish contract — used by NATS bridge', () => {
    const b = new Bus();
    const hook = vi.fn();
    b.onPublish(hook);
-    b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' });
+    b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' });
    expect(hook).toHaveBeenCalledOnce();
  });
 });
--- a/services/api/src/events/bus.ts
+++ b/services/api/src/events/bus.ts
@@ -45,6 +45,7 @@ export type RewardDeliveryFailedEvent = {

 export type TaskSyncedEvent = {
  userId: string;
+  source: string;   // e.g. 'todoist'
  count: number;
  syncedAt: string;
 };
--- a/services/api/src/events/nats.ts
+++ b/services/api/src/events/nats.ts
@@ -12,6 +12,7 @@

 import type { NatsConnection, JetStreamClient, StreamConfig } from 'nats';
 import { bus } from './bus.js';
+import { logger } from '../logger.js';

 let nc: NatsConnection | null = null;
 let js: JetStreamClient | null = null;
@@ -67,13 +68,13 @@ export async function connectNats(natsUrl: string): Promise<void> {
      if (!js) return;
      const data = new TextEncoder().encode(JSON.stringify(payload));
      js.publish(subject, data).catch((err: Error) =>
-        console.error(`[nats] publish failed for ${subject}: ${err.message}`),
+        logger.error({ err, subject }, 'nats publish failed'),
      );
    });

-    console.log(`[nats] connected to ${natsUrl}, streams: ${STREAMS.map((s) => s.name).join(', ')}`);
+    logger.info({ url: natsUrl, streams: STREAMS.map((s) => s.name) }, 'nats connected');
  } catch (err: any) {
-    console.warn(`[nats] connection failed — running without JetStream: ${err.message}`);
+    logger.warn({ err }, 'nats connection failed — running without JetStream');
  }
 }

--- a/services/api/src/index.ts
+++ b/services/api/src/index.ts
@@ -1,7 +1,10 @@
 import 'dotenv/config';
+import { logger } from './logger.js';
 import express from 'express';
+import { pinoHttp } from 'pino-http';
 import cookieParser from 'cookie-parser';
 import cors from 'cors';
+import { tracingMiddleware } from './middleware/tracing.js';
 import { config } from './config.js';
 import { db, runMigrations } from './db/index.js';
 import { tipScores, tipFeedback } from './db/schema.js';
@@ -12,7 +15,7 @@ import { integrationsRouter } from './routes/integrations.js';
 import { recommenderRouter } from './routes/recommender.js';
 import { userRouter } from './routes/user.js';
 import { pushRouter } from './routes/push.js';
-import { adminRouter } from './routes/admin.js';
+import { adminRouter, adminInternalRouter } from './routes/admin.js';
 import { mkdir } from 'fs/promises';
 import { dirname } from 'path';
 import { requireAuth } from './middleware/session.js';
@@ -26,13 +29,11 @@ import { registerProfileSubscriptions } from './profile/subscriber.js';
 await mkdir(dirname(config.DATABASE_PATH), { recursive: true });
 runMigrations();

-// Keep the API alive on stray async faults (e.g. a single bad admin route)
-// rather than dropping the whole process.
 process.on('unhandledRejection', (reason) => {
-  console.error('[api] unhandledRejection', reason);
+  logger.error({ err: reason }, 'unhandledRejection');
 });
 process.on('uncaughtException', (err) => {
-  console.error('[api] uncaughtException', err);
+  logger.fatal({ err }, 'uncaughtException');
 });

 const app = express();
@@ -43,6 +44,15 @@ app.use(
    credentials: true,
  }),
 );
+app.use(tracingMiddleware);
+app.use(
+  pinoHttp({
+    logger,
+    genReqId: (req) => req.traceId,
+    customProps: (req) => ({ traceId: req.traceId }),
+    autoLogging: { ignore: (req) => req.url === '/health' },
+  }),
+);
 app.use(express.json());
 app.use(cookieParser());
 app.use(sessionMiddleware);
@@ -55,17 +65,15 @@ app.use('/api', recommenderRouter);
 app.use('/api/user', userRouter);
 app.use('/api/push', pushRouter);
 app.use('/api/admin', adminRouter);
+app.use('/api/admin', adminInternalRouter);

-// Proxy ml/serving endpoints through the API (admin-only).
-// Allows admin UI to call /api/ml/stats/:userId, /api/ml/features/:userId
-// without needing direct access to the ml/serving port.
 app.use('/api/ml', requireAuth as any, requireAdmin as any, async (req: Request, res: Response) => {
  const mlUrl = config.ML_SERVING_URL;
  const target = `${mlUrl}${req.path}`;
  try {
    const upstream = await fetch(target, {
      method: req.method,
-      headers: { 'Content-Type': 'application/json' },
+      headers: { 'Content-Type': 'application/json', traceparent: req.traceparent },
      body: req.method !== 'GET' ? JSON.stringify(req.body) : undefined,
      signal: AbortSignal.timeout(5000),
    });
@@ -82,7 +90,7 @@ async function purgeExpiredData() {
    await db.delete(tipScores).where(lt(tipScores.servedAt, cutoff));
    await db.delete(tipFeedback).where(lt(tipFeedback.createdAt, cutoff));
  } catch (err: any) {
-    console.error(`[purge] retention cleanup failed: ${err.message}`);
+    logger.error({ err }, 'retention cleanup failed');
  }
 }

@@ -90,7 +98,7 @@ purgeExpiredData();
 setInterval(purgeExpiredData, 24 * 60 * 60 * 1000);

 app.listen(config.PORT, () => {
-  console.log(`oO API listening on http://localhost:${config.PORT}`);
+  logger.info({ port: config.PORT }, 'oO API listening');
 });

 if (config.NATS_URL) {
--- a/services/api/src/logger.ts
+++ b/services/api/src/logger.ts
@@ -0,0 +1,12 @@
+import pino from 'pino';
+import * as Sentry from '@sentry/node';
+
+if (process.env['SENTRY_DSN']) {
+  Sentry.init({
+    dsn: process.env['SENTRY_DSN'],
+    environment: process.env['NODE_ENV'] ?? 'development',
+  });
+}
+
+export const logger = pino({ level: process.env['LOG_LEVEL'] ?? 'info' });
+export { Sentry };
--- a/services/api/src/middleware/tracing.ts
+++ b/services/api/src/middleware/tracing.ts
@@ -0,0 +1,26 @@
+import { randomBytes } from 'crypto';
+import type { Request, Response, NextFunction } from 'express';
+
+declare global {
+  namespace Express {
+    interface Request {
+      traceId: string;
+      traceparent: string;
+    }
+  }
+}
+
+export function tracingMiddleware(req: Request, _res: Response, next: NextFunction): void {
+  const incoming = req.headers['traceparent'] as string | undefined;
+  let traceId: string;
+  if (incoming) {
+    const parts = incoming.split('-');
+    traceId = parts.length === 4 && parts[1]?.length === 32 ? parts[1] : randomBytes(16).toString('hex');
+  } else {
+    traceId = randomBytes(16).toString('hex');
+  }
+  const parentId = randomBytes(8).toString('hex');
+  req.traceId = traceId;
+  req.traceparent = `00-${traceId}-${parentId}-01`;
+  next();
+}
--- a/services/api/src/routes/tests/admin.test.ts
+++ b/services/api/src/routes/tests/admin.test.ts
@@ -4,7 +4,7 @@
 * A real Express app + in-memory SQLite DB per test suite.
 * Auth and admin middleware are mocked so we can focus on route logic.
 */
-import { describe, it, expect, vi, beforeAll } from 'vitest';
+import { describe, it, expect, vi, beforeAll, afterEach } from 'vitest';
 import express from 'express';
 import * as http from 'http';
 import { makeTestDb } from '../../test/db.js';
@@ -385,16 +385,126 @@ describe('GET /api/admin/events', () => {
  });
 });

+// ---------------------------------------------------------------------------
+// Health endpoint — mock fetch so tests don't depend on running services.
+// ---------------------------------------------------------------------------
 describe('GET /api/admin/health', () => {
-  it('returns 200 with ok, services array, and checkedAt', async () => {
+  const EXPECTED_HTTP_SERVICES = ['api', 'ml-serving', 'mlflow', 'airflow'] as const;
+  const EXPECTED_INTERNAL = ['sqlite', 'event-bus'] as const;
+  const VALID_STATUSES = new Set(['ok', 'degraded', 'down']);
+
+  type ServiceRow = { name: string; status: string; latencyMs: number };
+  type HealthBody = { ok: boolean; services: ServiceRow[]; checkedAt: string };
+
+  function mockFetch(upServices: Set<string>) {
+    // Resolve service name by port (matches defaults in config.ts).
+    // Up services return HTTP 200; absent ones throw (simulates connection refused → 'down').
+    vi.stubGlobal('fetch', async (url: string) => {
+      const s = String(url);
+      let name: string;
+      if (s.includes(':8000'))      name = 'ml-serving';
+      else if (s.includes(':5000')) name = 'mlflow';
+      else if (s.includes(':8080')) name = 'airflow';
+      else                          name = 'api';
+
+      if (!upServices.has(name)) throw new Error(`ECONNREFUSED ${name}`);
+      return { ok: true, json: async () => ({ ok: true, status: 'healthy' }) };
+    });
+  }
+
+  afterEach(() => vi.unstubAllGlobals());
+
+  it('shape: 200, typed fields, all expected services present', async () => {
+    mockFetch(new Set(['api', 'ml-serving', 'mlflow', 'airflow']));
    const { server, call } = await startServer(buildApp());
    try {
      const { status, body } = await call('GET', '/api/admin/health');
-      const b = body as { ok: boolean; services: { name: string; status: string }[]; checkedAt: string };
+      const b = body as HealthBody;
      expect(status).toBe(200);
      expect(typeof b.ok).toBe('boolean');
      expect(Array.isArray(b.services)).toBe(true);
      expect(typeof b.checkedAt).toBe('string');
+      expect(new Date(b.checkedAt).getTime()).toBeGreaterThan(0);
+
+      const names = b.services.map((s) => s.name);
+      for (const svc of [...EXPECTED_HTTP_SERVICES, ...EXPECTED_INTERNAL]) {
+        expect(names).toContain(svc);
+      }
+      for (const svc of b.services) {
+        expect(VALID_STATUSES).toContain(svc.status);
+        expect(typeof svc.latencyMs).toBe('number');
+      }
+    } finally {
+      server.close();
+    }
+  });
+
+  it('ok=true when all HTTP services respond 200', async () => {
+    mockFetch(new Set(['api', 'ml-serving', 'mlflow', 'airflow']));
+    const { server, call } = await startServer(buildApp());
+    try {
+      const { body } = await call('GET', '/api/admin/health');
+      const b = body as HealthBody;
+      for (const name of EXPECTED_HTTP_SERVICES) {
+        const svc = b.services.find((s) => s.name === name);
+        expect(svc?.status, `${name} should be ok`).toBe('ok');
+      }
+      expect(b.ok).toBe(true);
+    } finally {
+      server.close();
+    }
+  });
+
+  it('ml-serving=down and ok=false when ml-serving is unreachable', async () => {
+    mockFetch(new Set(['api', 'mlflow', 'airflow'])); // ml-serving absent
+    const { server, call } = await startServer(buildApp());
+    try {
+      const { body } = await call('GET', '/api/admin/health');
+      const b = body as HealthBody;
+      const mlSvc = b.services.find((s) => s.name === 'ml-serving');
+      expect(mlSvc?.status).toBe('down');
+      expect(b.ok).toBe(false);
+    } finally {
+      server.close();
+    }
+  });
+
+  it('airflow=down and ok=false when airflow is unreachable', async () => {
+    mockFetch(new Set(['api', 'ml-serving', 'mlflow'])); // airflow absent
+    const { server, call } = await startServer(buildApp());
+    try {
+      const { body } = await call('GET', '/api/admin/health');
+      const b = body as HealthBody;
+      const svc = b.services.find((s) => s.name === 'airflow');
+      expect(svc?.status).toBe('down');
+      expect(b.ok).toBe(false);
+    } finally {
+      server.close();
+    }
+  });
+
+  it('mlflow=down and ok=false when mlflow is unreachable', async () => {
+    mockFetch(new Set(['api', 'ml-serving', 'airflow'])); // mlflow absent
+    const { server, call } = await startServer(buildApp());
+    try {
+      const { body } = await call('GET', '/api/admin/health');
+      const b = body as HealthBody;
+      const svc = b.services.find((s) => s.name === 'mlflow');
+      expect(svc?.status).toBe('down');
+      expect(b.ok).toBe(false);
+    } finally {
+      server.close();
+    }
+  });
+
+  it('sqlite and event-bus are always present regardless of HTTP service status', async () => {
+    mockFetch(new Set()); // all HTTP services down
+    const { server, call } = await startServer(buildApp());
+    try {
+      const { body } = await call('GET', '/api/admin/health');
+      const b = body as HealthBody;
+      expect(b.services.find((s) => s.name === 'sqlite')?.status).toBe('ok');
+      expect(b.services.find((s) => s.name === 'event-bus')?.status).toBe('ok');
    } finally {
      server.close();
    }
--- a/services/api/src/routes/admin.ts
+++ b/services/api/src/routes/admin.ts
@@ -1,4 +1,5 @@
-import { type Router as ExpressRouter, Router, Response } from 'express';
+import { type Router as ExpressRouter, Router, Response, type Request } from 'express';
+import { logger } from '../logger.js';
 import { db, rawSqlite } from '../db/index.js';
 import {
  users,
@@ -523,16 +524,24 @@ router.get('/data-quality', async (req: AuthenticatedRequest, res: Response) =>
 // Fan-out to all subsystem /health endpoints.
 // ---------------------------------------------------------------------------
 router.get('/health', async (_req: AuthenticatedRequest, res: Response) => {
-  const checks: Array<{ name: string; url: string }> = [
-    { name: 'api', url: `http://localhost:${process.env.PORT ?? 3001}/health` },
+  const airflowAuth = Buffer.from(`${config.AIRFLOW_API_USER}:${config.AIRFLOW_API_PASSWORD}`).toString('base64');
+
+  const checks: Array<{ name: string; url: string; headers?: Record<string, string> }> = [
+    { name: 'api',        url: `http://localhost:${config.PORT}/health` },
    { name: 'ml-serving', url: `${config.ML_SERVING_URL}/health` },
+    { name: 'mlflow',     url: `${config.MLFLOW_URL}/health` },
+    { name: 'airflow',    url: `${config.AIRFLOW_URL}/api/v1/health`,
+      headers: { Authorization: `Basic ${airflowAuth}` } },
  ];

  const results = await Promise.allSettled(
-    checks.map(async ({ name, url }) => {
+    checks.map(async ({ name, url, headers }) => {
      const t0 = Date.now();
      try {
-        const r = await fetch(url, { signal: AbortSignal.timeout(3000) });
+        const r = await fetch(url, {
+          headers,
+          signal: AbortSignal.timeout(3000),
+        });
        return { name, status: r.ok ? 'ok' : 'degraded', latencyMs: Date.now() - t0 };
      } catch {
        return { name, status: 'down', latencyMs: Date.now() - t0 };
@@ -548,15 +557,12 @@ router.get('/health', async (_req: AuthenticatedRequest, res: Response) => {
    dbStatus = 'down';
  }

-  // Event bus: always ok if process is alive
-  const eventBusStatus = 'ok';
-
  const services = results.map((r) =>
    r.status === 'fulfilled' ? r.value : { name: 'unknown', status: 'down', latencyMs: 0 },
  );

  services.push({ name: 'sqlite',    status: dbStatus, latencyMs: 0 });
-  services.push({ name: 'event-bus', status: eventBusStatus, latencyMs: 0 });
+  services.push({ name: 'event-bus', status: 'ok',     latencyMs: 0 });

  const allOk = services.every((s) => s.status === 'ok');
  res.json({ ok: allOk, services, checkedAt: new Date().toISOString() });
@@ -699,22 +705,21 @@ router.delete('/saved-queries/:id', async (req: AuthenticatedRequest, res: Respo

 // ---------------------------------------------------------------------------
 // POST /api/admin/simulate/start
-// Spawn ml/experiments/sim/runner.py in the background; return run_id.
+// Trigger an Airflow DAG run (bandit_sim). Falls back to a local subprocess
+// when AIRFLOW_URL is not reachable, so local dev still works.
 // ---------------------------------------------------------------------------
 router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response) => {
  const {
    nUsers = 5,
    nRounds = 20,
    tasksPerRound = 8,
-    useLlm = false,
    judgeMode = 'rule',
    policies = ['linucb-v1', 'egreedy-v1'],
  } = req.body as {
    nUsers?: number;
    nRounds?: number;
    tasksPerRound?: number;
-    useLlm?: boolean;
-    judgeMode?: 'rule' | 'llm' | 'claude-code';
+    judgeMode?: 'rule' | 'llm';
    policies?: string[];
  };

@@ -733,17 +738,69 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
    nUsers,
    nRounds,
    tasksPerRound,
-    useLlm,
+    useLlm: judgeMode === 'llm',
+    judgeMode,
+    nPolicies: policies.length,
    status: 'running',
    createdAt: now,
  });

+  // ── Try Airflow first ────────────────────────────────────────────────────
+  if (config.AIRFLOW_URL && config.INTERNAL_API_TOKEN) {
+    try {
+      const airflowAuth = Buffer.from(
+        `${config.AIRFLOW_API_USER}:${config.AIRFLOW_API_PASSWORD}`,
+      ).toString('base64');
+
+      const dagRes = await fetch(
+        `${config.AIRFLOW_URL}/api/v1/dags/bandit_sim/dagRuns`,
+        {
+          method: 'POST',
+          headers: {
+            'Content-Type': 'application/json',
+            Authorization: `Basic ${airflowAuth}`,
+          },
+          body: JSON.stringify({
+            conf: {
+              sim_run_id: id,
+              n_users: nUsers,
+              n_rounds: nRounds,
+              tasks_per_round: tasksPerRound,
+              policies,
+              judge_mode: judgeMode,
+              ml_url: config.ML_SERVING_URL,
+              mlflow_url: config.MLFLOW_URL,
+              callback_url: `${config.API_BASE_URL}/api/admin/simulate/${id}/complete`,
+              internal_token: config.INTERNAL_API_TOKEN,
+            },
+          }),
+          signal: AbortSignal.timeout(5000),
+        },
+      );
+
+      if (dagRes.ok) {
+        const dagBody = await dagRes.json() as { dag_run_id: string };
+        await db
+          .update(simRuns)
+          .set({ airflowDagRunId: dagBody.dag_run_id })
+          .where(eq(simRuns.id, id));
+
+        res.json({ id, status: 'running', airflow_dag_run_id: dagBody.dag_run_id });
+        return;
+      }
+      logger.warn({ status: dagRes.status }, 'sim: Airflow trigger failed, falling back to subprocess');
+    } catch (err) {
+      logger.warn({ err }, 'sim: Airflow unreachable, falling back to subprocess');
+    }
+  }
+
+  // ── Subprocess fallback (local dev / Airflow not configured) ────────────
  const runnerPath = resolve(__dirname, '../../../../ml/experiments/sim/runner.py');
  const venvPython = resolve(__dirname, '../../../../ml/serving/.venv/bin/python');
  const pythonBin = existsSync(venvPython) ? venvPython : 'python3';
  const outPath = `/tmp/oo-sim-${id}.json`;

-  const args = [
+  const child = spawn(pythonBin, [
    runnerPath,
    '--n-users', String(nUsers),
    '--n-rounds', String(nRounds),
@@ -751,32 +808,22 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
    '--ml-url', config.ML_SERVING_URL,
    '--policies', ...policies,
    '--out', outPath,
-    '--judge', judgeMode === 'llm' ? 'llm' : judgeMode === 'claude-code' ? 'rule' : 'rule',
-    // claude-code mode isn't auto-runnable from the API (requires human in the loop)
-    // it falls back to rule judge when triggered from the panel
-  ];
+    '--judge', judgeMode,
+    '--mlflow-url', config.MLFLOW_URL,
+    '--mlflow-experiment', 'bandit_simulation',
+  ], { stdio: ['ignore', 'pipe', 'pipe'] });

-  const child = spawn(pythonBin, args, { stdio: ['ignore', 'pipe', 'pipe'] });
+  if (child.pid) _simProcesses.set(id, { pid: child.pid, startedAt: now });

-  if (child.pid) {
-    _simProcesses.set(id, { pid: child.pid, startedAt: now });
-  }
-
-  // Without this listener, a spawn failure (ENOENT when python3 is absent
-  // — e.g. in the alpine api container) would emit an unhandled 'error' event
-  // and crash the whole API process.
  child.on('error', async (err) => {
-    console.error('[sim] spawn error', err);
+    logger.error({ err }, 'sim: spawn error');
    _simProcesses.delete(id);
-    await db
-      .update(simRuns)
+    await db.update(simRuns)
      .set({ status: 'failed', finishedAt: new Date().toISOString() })
      .where(eq(simRuns.id, id));
  });

-  // Capture stderr for debugging
-  const stderrLines: string[] = [];
-  child.stderr?.on('data', (d: Buffer) => stderrLines.push(d.toString()));
+  child.stderr?.on('data', (d: Buffer) => logger.debug({ stderr: d.toString() }, 'sim stderr'));

  child.on('exit', async (code) => {
    _simProcesses.delete(id);
@@ -785,8 +832,6 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
    if (code === 0 && existsSync(outPath)) {
      try {
        const raw = JSON.parse(readFileSync(outPath, 'utf-8'));
-
-        // Bulk-insert sim events
        const eventRows = (raw.events ?? []).map((ev: Record<string, unknown>) => ({
          id: nanoid(),
          runId: id,
@@ -804,21 +849,19 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
          dayOfWeek: Number(ev.day_of_week),
          createdAt: now,
        }));
-
        for (const row of eventRows) {
          await db.insert(simEvents).values(row).catch(() => {});
        }
-
        await db.update(simRuns).set({
          status: 'done',
          summaryJson: JSON.stringify(raw.summary),
          winner: raw.winner,
          personaBreakdownJson: JSON.stringify(raw.persona_breakdown),
+          mlflowRunId: raw.mlflow_run_id ?? null,
          finishedAt,
        }).where(eq(simRuns.id, id));
-
        try { unlinkSync(outPath); } catch { /* ignore */ }
-      } catch (e) {
+      } catch {
        await db.update(simRuns).set({ status: 'failed', finishedAt }).where(eq(simRuns.id, id));
      }
    } else {
@@ -863,4 +906,68 @@ router.get('/simulate/:id', async (req: AuthenticatedRequest, res: Response) =>
  res.json({ run: { ...run, isRunning }, events });
 });

-export { router as adminRouter };
+// ---------------------------------------------------------------------------
+// internalRouter — no session auth; only INTERNAL_API_TOKEN header check.
+// Mounted separately in index.ts at /api/admin to avoid router.use() auth.
+// ---------------------------------------------------------------------------
+const internalRouter: ExpressRouter = Router();
+
+internalRouter.post('/simulate/:id/complete', async (req: Request, res: Response) => {
+  const token = req.headers['x-internal-token'];
+  if (!config.INTERNAL_API_TOKEN || token !== config.INTERNAL_API_TOKEN) {
+    res.status(401).json({ error: 'Unauthorized' });
+    return;
+  }
+
+  const { id } = req.params as { id: string };
+  const { summary, winner, persona_breakdown, events: rawEvents, mlflow_run_id } =
+    req.body as {
+      summary: Record<string, unknown>;
+      winner: string;
+      persona_breakdown: Record<string, unknown>;
+      events: Record<string, unknown>[];
+      mlflow_run_id?: string;
+    };
+
+  const finishedAt = new Date().toISOString();
+  const now = finishedAt;
+
+  try {
+    const eventRows = (rawEvents ?? []).map((ev) => ({
+      id: nanoid(),
+      runId: id,
+      round: Number(ev['round']),
+      userId: String(ev['user_id']),
+      persona: String(ev['persona']),
+      policy: String(ev['policy']),
+      tipContent: String(ev['tip_content']),
+      priority: Number(ev['priority']),
+      isOverdue: Boolean(ev['is_overdue']),
+      action: String(ev['action']),
+      dwellMs: ev['dwell_ms'] != null ? Number(ev['dwell_ms']) : null,
+      rewardMilli: Math.round(Number(ev['reward']) * 1000),
+      hour: Number(ev['hour']),
+      dayOfWeek: Number(ev['day_of_week']),
+      createdAt: now,
+    }));
+    for (const row of eventRows) {
+      await db.insert(simEvents).values(row).catch(() => {});
+    }
+    await db.update(simRuns).set({
+      status: 'done',
+      summaryJson: JSON.stringify(summary),
+      winner,
+      personaBreakdownJson: JSON.stringify(persona_breakdown),
+      mlflowRunId: mlflow_run_id ?? null,
+      finishedAt,
+    }).where(eq(simRuns.id, id));
+
+    res.json({ ok: true });
+  } catch (err) {
+    logger.error({ err }, 'sim: complete callback failed');
+    await db.update(simRuns).set({ status: 'failed', finishedAt }).where(eq(simRuns.id, id));
+    res.status(500).json({ error: 'Failed to store results' });
+  }
+});
+
+export { router as adminRouter, internalRouter as adminInternalRouter };
--- a/services/api/src/routes/auth.ts
+++ b/services/api/src/routes/auth.ts
@@ -5,6 +5,7 @@ import { db } from '../db/index.js';
 import { users, sessions } from '../db/schema.js';
 import { eq } from 'drizzle-orm';
 import { config } from '../config.js';
+import { logger } from '../logger.js';

 const router: ExpressRouter = Router();

@@ -36,7 +37,7 @@ router.get('/login', async (req: Request, res: Response) => {
  setTimeout(() => pendingStates.delete(state), 10 * 60 * 1000);

  const redirectUri = `${config.API_BASE_URL}/api/auth/callback`;
-  console.log('[auth] redirect_uri sent to Google:', redirectUri);
+  logger.info({ redirectUri }, 'auth: redirect_uri');
  const authUrl = client.buildAuthorizationUrl(cfg, {
    redirect_uri: redirectUri,
    scope: 'openid email profile',
@@ -72,7 +73,7 @@ router.get('/callback', async (req: Request, res: Response) => {
      expectedState: state,
    });
  } catch (err) {
-    console.error('OAuth callback error', err);
+    logger.error({ err }, 'auth: OAuth callback error');
    res.status(400).json({ error: 'OAuth error' });
    return;
  }
@@ -123,6 +124,45 @@ router.get('/callback', async (req: Request, res: Response) => {
    .redirect(`${config.WEB_BASE_URL}${pending.redirectTo}`);
 });

+/**
+ * POST /api/auth/token
+ * Exchange the static ADMIN_TOKEN for a session cookie.
+ * Finds the first admin user in the DB; rejects if ADMIN_TOKEN is not configured.
+ */
+router.post('/token', async (req: Request, res: Response) => {
+  const { token } = req.body as { token?: string };
+  if (!config.ADMIN_TOKEN || !token || token !== config.ADMIN_TOKEN) {
+    res.status(401).json({ error: 'Invalid token' });
+    return;
+  }
+
+  const [adminUser] = await db
+    .select()
+    .from(users)
+    .where(eq(users.role, 'admin'))
+    .limit(1);
+
+  if (!adminUser) {
+    res.status(403).json({ error: 'No admin user exists' });
+    return;
+  }
+
+  const sid = nanoid(32);
+  const now = new Date().toISOString();
+  const expiresAt = new Date(Date.now() + 24 * 60 * 60 * 1000).toISOString();
+  await db.insert(sessions).values({ id: sid, userId: adminUser.id, expiresAt, createdAt: now });
+
+  res
+    .cookie('sid', sid, {
+      httpOnly: true,
+      secure: config.NODE_ENV === 'production',
+      sameSite: 'lax',
+      expires: new Date(expiresAt),
+      path: '/',
+    })
+    .json({ ok: true });
+});
+
 /** POST /api/auth/logout */
 router.post('/logout', async (req: Request, res: Response) => {
  const sid = req.cookies?.sid as string | undefined;
--- a/services/api/src/routes/recommender.ts
+++ b/services/api/src/routes/recommender.ts
@@ -1,5 +1,6 @@
 import { type Router as ExpressRouter, Router, Response } from 'express';
 import { nanoid } from 'nanoid';
+import { logger } from '../logger.js';
 import { db } from '../db/index.js';
 import { integrationTokens, tipFeedback, tipViews, tipScores } from '../db/schema.js';
 import { eq, and, desc } from 'drizzle-orm';
@@ -47,7 +48,8 @@ export const _clearCandidateCacheForTests = () => {
 // Shadow-policy registry
 // ---------------------------------------------------------------------------
 const shadowPolicies = new Map<string, { active: boolean }>([
-  // egreedy-v2 (D=12, profile features) — disabled until sim gate per ADR-0012
+  // egreedy-v2 promoted to active policy (ADR-0012). Shadow entry kept for
+  // rollback toggle; leave disabled in normal operation.
  ['egreedy-v2-shadow', { active: false }],
 ]);

@@ -84,6 +86,7 @@ async function remotePolicy(
  userId: string,
  tasks: TipCandidate[],
  profile: Profile,
+  traceparent?: string,
 ): Promise<{ tipId: string; score: number; policy: string } | null> {
  const hour = new Date().getHours();
  const dayOfWeek = new Date().getDay();
@@ -101,17 +104,16 @@ async function remotePolicy(
    profile_features: profile,
  };

-  // Active policy: egreedy-v1 (selected over linucb-v1 after offline sim — ADR-0007)
  try {
-    const res = await fetch(`${config.ML_SERVING_URL}/score/egreedy`, {
+    const res = await fetch(`${config.ML_SERVING_URL}/score/egreedy/v2`, {
      method: 'POST',
-      headers: { 'Content-Type': 'application/json' },
+      headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
      body: JSON.stringify(body),
      signal: AbortSignal.timeout(3000),
    });
    if (!res.ok) return null;
    const data = (await res.json()) as { tip_id: string; score: number };
-    return { tipId: data.tip_id, score: data.score, policy: 'egreedy-v1' };
+    return { tipId: data.tip_id, score: data.score, policy: 'egreedy-v2' };
  } catch {
    return null;
  }
@@ -145,6 +147,7 @@ async function fetchLlmCandidates(
  dayOfWeek: number,
  promptVersion: string | null,
  profile: Profile,
+  traceparent?: string,
 ): Promise<LlmGenerateResult> {
  try {
    const tasks = signals.slice(0, 10).map((s) => ({
@@ -155,7 +158,7 @@ async function fetchLlmCandidates(
    }));
    const res = await fetch(`${config.ML_SERVING_URL}/generate`, {
      method: 'POST',
-      headers: { 'Content-Type': 'application/json' },
+      headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
      body: JSON.stringify({
        user_id: userId,
        context: { tasks, hour_of_day: hour, day_of_week: dayOfWeek },
@@ -225,6 +228,7 @@ router.post('/recommend', requireAuth, async (req: AuthenticatedRequest, res: Re
    dayOfWeek,
    requestedPromptVersion,
    profile,
+    req.traceparent,
  );

  const allCandidates: TipCandidate[] = [...signalCandidates, ...llmResult.candidates];
@@ -239,7 +243,7 @@ router.post('/recommend', requireAuth, async (req: AuthenticatedRequest, res: Re
  const t0 = Date.now();

  // Stage 2: score — egreedy bandit with random fallback
-  const scored = await remotePolicy(req.userId!, allCandidates, profile);
+  const scored = await remotePolicy(req.userId!, allCandidates, profile, req.traceparent);
  const latencyMs = Date.now() - t0;
  const tip = scored
    ? (allCandidates.find((t) => t.id === scored.tipId) ?? randomPolicy(allCandidates))
@@ -371,6 +375,8 @@ async function sendRewardWithRetry(
  tipId: string,
  reward: number,
  features: TipCandidate['features'],
+  profile: Profile,
+  traceparent?: string,
 ): Promise<void> {
  const body = JSON.stringify({
    user_id: userId,
@@ -378,13 +384,14 @@ async function sendRewardWithRetry(
    reward,
    features,
    day_of_week: new Date().getDay(),
+    profile_features: profile,
  });

  for (let attempt = 1; attempt <= 3; attempt++) {
    try {
-      const res = await fetch(`${config.ML_SERVING_URL}/reward/egreedy`, {
+      const res = await fetch(`${config.ML_SERVING_URL}/reward/egreedy/v2`, {
        method: 'POST',
-        headers: { 'Content-Type': 'application/json' },
+        headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
        body,
        signal: AbortSignal.timeout(3000),
      });
@@ -392,7 +399,7 @@ async function sendRewardWithRetry(
      throw new Error(`HTTP ${res.status}`);
    } catch (err: any) {
      if (attempt === 3) {
-        console.error(`[reward] failed after 3 attempts for tip ${tipId}: ${err.message}`);
+        logger.error({ tipId, err }, 'reward: failed after 3 attempts');
        bus.publish('signals.tip.reward_failed', {
          userId,
          tipId,
@@ -463,7 +470,9 @@ router.post('/tip/:id/feedback', requireAuth, async (req: AuthenticatedRequest,
  });

  if (candidate) {
-    sendRewardWithRetry(req.userId!, tipId, reward, candidate.features);
+    // Re-fetch profile for the v2 ridge update; TTL cache makes this near-instant.
+    const profile = await getProfile(req.userId!);
+    sendRewardWithRetry(req.userId!, tipId, reward, candidate.features, profile, req.traceparent);
  }

  // Delegate action to the owning signal source (e.g. mark done in Todoist)
--- a/services/api/src/signals/tests/scheduler.test.ts
+++ b/services/api/src/signals/tests/scheduler.test.ts
@@ -8,6 +8,11 @@
 */
 import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';

+vi.mock('../../logger.js', () => ({
+  logger: { info: vi.fn(), warn: vi.fn(), error: vi.fn(), fatal: vi.fn() },
+}));
+import { logger } from '../../logger.js';
+
 // ── mock the drizzle query chain: db.select(...).from(...).where(...) ────────
 let users: { userId: string }[] = [];
 const whereMock = vi.fn(async () => users);
@@ -35,6 +40,7 @@ beforeEach(() => {
  whereMock.mockClear();
  fromMock.mockClear();
  selectMock.mockClear();
+  vi.clearAllMocks();
  vi.useFakeTimers();
 });

@@ -102,8 +108,6 @@ describe('startTodoistSyncScheduler', () => {
      if (id === 'bad') throw new Error('todoist 401');
      return [];
    });
-    const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
-    const logSpy = vi.spyOn(console, 'log').mockImplementation(() => {});

    startTodoistSyncScheduler(60_000);
    await vi.advanceTimersByTimeAsync(10_001);
@@ -112,19 +116,27 @@ describe('startTodoistSyncScheduler', () => {
    await Promise.resolve();

    expect(fetchSignalsMock).toHaveBeenCalledTimes(3);
-    expect(errSpy).toHaveBeenCalledWith(expect.stringContaining('sync error'), expect.anything());
-    expect(logSpy).toHaveBeenCalledWith(expect.stringContaining('2 ok, 1 failed'));
+    expect(logger.error).toHaveBeenCalledWith(
+      expect.objectContaining({ err: expect.anything() }),
+      'scheduler: sync error',
+    );
+    expect(logger.info).toHaveBeenCalledWith(
+      expect.objectContaining({ ok: 2, failed: 1 }),
+      'scheduler: todoist sync',
+    );
  });

  it('survives a db query failure — logs and skips the tick', async () => {
    const { startTodoistSyncScheduler } = await import('../scheduler.js');
    whereMock.mockRejectedValueOnce(new Error('sqlite locked'));
-    const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});

    startTodoistSyncScheduler(60_000);
    await vi.advanceTimersByTimeAsync(10_001);

    expect(fetchSignalsMock).not.toHaveBeenCalled();
-    expect(errSpy).toHaveBeenCalledWith(expect.stringContaining('failed to query users'));
+    expect(logger.error).toHaveBeenCalledWith(
+      expect.objectContaining({ err: expect.anything() }),
+      'scheduler: failed to query users',
+    );
  });
 });
--- a/services/api/src/signals/aggregator.ts
+++ b/services/api/src/signals/aggregator.ts
@@ -1,4 +1,5 @@
 import type { Signal, SignalSource } from '@oo/shared-types';
+import { logger } from '../logger.js';

 /**
 * Merges signals from all registered sources for a user.
@@ -24,7 +25,7 @@ export class SignalAggregator {
      if (r.status === 'fulfilled') {
        signals.push(...r.value);
      } else {
-        console.error(`[aggregator] source '${this.sources[i].id}' failed:`, r.reason);
+        logger.error({ sourceId: this.sources[i]!.id, err: r.reason }, 'aggregator: source failed');
      }
    }
    return signals;
--- a/services/api/src/signals/scheduler.ts
+++ b/services/api/src/signals/scheduler.ts
@@ -13,6 +13,7 @@ import { db } from '../db/index.js';
 import { integrationTokens } from '../db/schema.js';
 import { eq } from 'drizzle-orm';
 import { todoistSource } from './todoist.js';
+import { logger } from '../logger.js';

 const DEFAULT_INTERVAL_MS = 15 * 60 * 1000;

@@ -25,7 +26,7 @@ export function startTodoistSyncScheduler(intervalMs = DEFAULT_INTERVAL_MS): Nod
        .from(integrationTokens)
        .where(eq(integrationTokens.tokenStatus, 'active'));
    } catch (err: any) {
-      console.error(`[scheduler] failed to query users: ${err.message}`);
+      logger.error({ err }, 'scheduler: failed to query users');
      return;
    }

@@ -39,10 +40,10 @@ export function startTodoistSyncScheduler(intervalMs = DEFAULT_INTERVAL_MS): Nod
    let failed = 0;
    for (const r of results) {
      if (r.status === 'fulfilled') ok++;
-      else { failed++; console.error(`[scheduler] sync error:`, r.reason); }
+      else { failed++; logger.error({ err: r.reason }, 'scheduler: sync error'); }
    }

-    console.log(`[scheduler] todoist sync: ${ok} ok, ${failed} failed (${users.length} users)`);
+    logger.info({ ok, failed, total: users.length }, 'scheduler: todoist sync');
  }

  // Run once shortly after startup, then on interval
--- a/services/api/src/signals/todoist.ts
+++ b/services/api/src/signals/todoist.ts
@@ -3,6 +3,7 @@ import { db } from '../db/index.js';
 import { integrationTokens } from '../db/schema.js';
 import { eq, and } from 'drizzle-orm';
 import { bus } from '../events/bus.js';
+import { logger } from '../logger.js';

 const CACHE_TTL_MS = 30_000;

@@ -46,7 +47,7 @@ export class TodoistSignalSource implements SignalSource {

    if (!res.ok) {
      if (res.status === 401) {
-        console.error(`[todoist] token expired for user ${userId}`);
+        logger.warn({ userId }, 'todoist: token expired');
        bus.publish('signals.integration.token_expired', {
          userId,
          provider: 'todoist',
@@ -88,7 +89,7 @@ export class TodoistSignalSource implements SignalSource {
    });

    this.cache.set(userId, { signals, fetchedAt: Date.now() });
-    bus.publish('signals.task.synced', { userId, count: signals.length, syncedAt: now });
+    bus.publish('signals.task.synced', { userId, source: 'todoist', count: signals.length, syncedAt: now });

    return signals;
  }
--- a/services/integrations/README.md
+++ b/services/integrations/README.md
@@ -2,30 +2,49 @@

 Third-party connectors and the token vault.

-## Connector interface
+## Signal source interface
+
+Each connector implements `SignalSource` from `@oo/shared-types`:

 ```ts
-interface Connector {
-  id: string                                // e.g. "todoist"
-  scopes: string[]                          // human-readable list shown in consent UI
-  beginOAuth(user): Promise<{ redirectUrl, state }>
-  finishOAuth(code, state): Promise<StoredCredential>
-  fetchSignals(user, since?): AsyncIterable<NormalizedEvent>
-  // incremental-sync cursor (Todoist sync_token, webhook timestamps, etc.)
-  // stored in Credential.meta; the connector owns its shape.
-  act?(user, action): Promise<void>          // optional write-back (complete task, etc.)
-  revoke(user): Promise<void>                // REQUIRED: provider-side token revocation on disconnect
+interface SignalSource {
+  readonly id: string                                       // e.g. "todoist"
+  fetchSignals(userId: string): Promise<Signal[]>          // returns normalized Signal[]
+  act?(userId: string, signalId: string, action: string): Promise<void>  // optional write-back
 }
 ```

+`SignalAggregator` (`services/api/src/signals/aggregator.ts`) fans out to all registered sources in parallel, isolating per-source failures.
+
 ## Token vault

- Credentials encrypted at rest (libsodium sealed box); key from env/KMS.
- Refresh handled transparently; consumers never see raw tokens.
- One row per `(user, provider)` with provider-specific `meta`.
+OAuth tokens stored in the `integration_tokens` SQLite table (`services/api/src/db/schema.ts`):

-## Roadmap
+| Column | Description |
+|--------|-------------|
+| `userId` | owner |
+| `provider` | e.g. `todoist` |
+| `accessToken` | OAuth access token (plain in dev; encrypted in prod via server secret store) |
+| `tokenStatus` | `active` \| `needs_reconnect` |

- Phase 0: **Todoist** (OAuth2, read tasks, complete task).
- Phase 2: Google Calendar, Apple Health (web import), generic webhook ingress.
- Phase 5: public SDK so third parties can ship connectors.
+On a 401 from the upstream API, the connector marks the token `needs_reconnect` and publishes `signals.integration.token_expired` so the client can prompt re-auth.
+
+## Routes
+
+| Method | Path | Description |
+|--------|------|-------------|
+| `GET` | `/api/integrations` | List connected integrations for current user |
+| `GET` | `/api/integrations/todoist/connect` | Start Todoist OAuth flow |
+| `GET` | `/api/integrations/todoist/callback` | OAuth callback — exchange code, store token |
+| `DELETE` | `/api/integrations/:provider` | Disconnect + delete token |
+
+## Connectors
+
+| Connector | Status | Signals produced |
+|-----------|--------|-----------------|
+| Todoist | Phase 1 — active | `task` signals (today + overdue); `done` write-back |
+| Google Calendar | Phase 2 — planned | `event` signals |
+
+## Extraction criteria
+
+Extract to its own process when credential blast-radius isolation requires it (e.g. token vault with KMS-backed encryption needs to run in a hardened sidecar) or when connector volume justifies separate scaling.
--- a/services/recommender/README.md
+++ b/services/recommender/README.md
@@ -1,29 +1,42 @@
 # recommender

-The core of oO. Takes a user + a context, returns **one** tip.
+The core of oO. Takes a user + context, returns **one** tip.

 ## Contract

 ```
-POST /recommend
-  { user_id, context?: { time, timezone, client, ... } }
-  → { tip: { id, kind: "todo"|"advice", title, body, source, deep_link, meta } }
+POST /api/recommend
+  { }  (user inferred from session)
+  → { tip: { id, content, source, kind, sourceId?, rationale?, createdAt } }

-POST /feedback
-  { user_id, tip_id, reaction: "done"|"snooze"|"dismiss", at }
+POST /api/tip/:id/feedback
+  { action: "done"|"dismiss"|"snooze"|"helpful"|"not_helpful", dwellMs? }
+  → { ok: true }
 ```

-## Internals (stable seams)
+## Pipeline

- **Candidate sources** — pluggable async generators. v0: Todoist tasks via `integrations`. Later: advice library, calendar nudges, health prompts.
- **Feature assembler** — fills the `context` blob (inline in Phase 0; calls feature store from M1). Never inlined into policy code.
- **Policy registry** — `Policy.pick(candidates, context) → tip`. Named entries:
-  - `random` — v0 (Phase 0).
-  - `bandit.linucb.pooled` — v1 (Phase 1). **Global-then-personalize**: pooled features shared across users; per-user residual once data allows.
-  - `remote` — delegates to `ml/serving` FastAPI scorer (Phase 1+).
- **Shadow hook** — every request optionally runs N shadow policies in parallel and logs their picks + estimated rewards. Promotion from shadow → A/B → launch is a separate, deliberate step (ADR-0002).
- **TipInstance persistence** — every decision writes `context_snapshot` (features seen at decision time). This is what makes offline replay honest.
+1. **Signals** — `SignalAggregator.fetchAll(userId)` fans out to all registered `SignalSource` implementations in parallel. Currently: `TodoistSignalSource`. Add a source via `aggregator.register(new MySource())`.
+2. **LLM candidates** — `POST /generate` on `ml/serving` returns `TipCandidate[]` from the `tip-generator` LiteLLM alias.
+3. **Scoring** — all candidates sent to `ml/serving` active policy (`POST /score/egreedy`). Falls back to random if `ml/serving` is unreachable.
+4. **Shadow policies** — active policy runs shadow policies in the same request for offline comparison (ADR-0002). Currently: `egreedy-v2` shadows `egreedy-v1`.
+5. **Persistence** — `tipViews` + `tipScores` rows written on every serve; `tipFeedback` row on reaction.
+6. **Reward delivery** — reaction triggers `POST /reward/egreedy` on `ml/serving` with inferred reward value.

-## Phase 0 goal
+## Signal normalization

-`RandomPolicy` only. The service, contract, registry, shadow hook, and tip-instance persistence all exist; no ML yet.
+Signals carry `features: Record<string, number | boolean>` (bandit-ready) and `metadata: Record<string, unknown>` (source-specific raw fields). The bandit treats features as an opaque dict — sources own their feature names. See ADR-0009.
+
+## Policy registry
+
+| Policy | Status | Notes |
+|--------|--------|-------|
+| `random` | Fallback | Used when ml/serving is unreachable |
+| `egreedy-v1` | Shadow | d=7, ADR-0007 |
+| `egreedy-v2` | **Active** | d=12 + profile features, ADR-0012 |
+
+Shadow → active promotion requires offline sim + online agreement (ADR-0002).
+
+## Extraction criteria
+
+Extract to its own process at scaling hotspot: when `POST /recommend` p99 latency exceeds SLA or when recommendation CPU displaces API serving on shared host.
Author	SHA1	Message	Date
alvis	e40dfdcbb0	chore(infra): wire MLflow/Airflow env vars, fix healthcheck, add .dockerignore Some checks failed buf-check / Lint & breaking-change check (push) Has been cancelled Details - docker-compose: pass ML_SERVING_URL, MLFLOW_URL, AIRFLOW_URL + creds to api service - docker-compose: pass NEXT_PUBLIC_MLFLOW_URL/AIRFLOW_URL to admin service - docker-compose: replace wget healthcheck with node fetch (wget not in node image) - docker-compose: enable Airflow basic_auth API backend; add MLflow pip dep for DAGs - Dockerfiles: tighten layer caching, add .dockerignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-26 12:08:43 +00:00
alvis	bad1bb2cba	feat(simulate): MLflow tracking, Airflow DAG integration, health checks for mlflow/airflow - sim_runs schema: add judge_mode, n_policies, airflow_dag_run_id, mlflow_run_id columns - admin health endpoint: add mlflow + airflow checks (Basic auth for Airflow API) - admin nav: add Simulations page link; rename section label - runner.py: optional MLflow experiment tracking; multi-policy support - sim_dag.py: Airflow DAG for offline sim pipeline - admin simulate page + API client methods for sim runs - shared-types tsconfig: exclude test files from build Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-26 12:08:36 +00:00
alvis	e96ceb7ee1	feat(auth): token-based admin authentication for Playwright/CI (#105 ) Add POST /api/auth/token — validates ADMIN_TOKEN env var, creates a 24h session and sets the sid cookie so automated tools can access the admin panel without Google OAuth. Admin login page gains a token input form. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-26 12:07:43 +00:00
alvis	b554970032	docs(observability): add services/api README; update ml/serving + recommender docs (#18 ) - services/api/README.md: new — contract, middleware stack, background tasks, config table (LOG_LEVEL, SENTRY_DSN), health story, extraction criteria - ml/serving/README.md: add Observability section (structlog JSON, traceparent → trace_id binding), add SENTRY_DSN + ENV to config table - services/recommender/README.md: fix policy table — egreedy-v2 is active (#99), egreedy-v1 is shadow Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-26 03:41:39 +00:00
alvis	c4960d0601	feat(observability): structured logs, W3C trace IDs, Sentry hooks (#18 ) - TS: pino + pino-http; every HTTP request log includes traceId from W3C traceparent header (generated if absent); forwarded to ml/serving on all /score, /generate, /reward, and /api/ml proxy calls - Python: structlog JSON; FastAPI middleware binds trace_id via contextvars so every log line within a request carries it - Sentry: optional SENTRY_DSN init in both runtimes (no-op if unset) - Replace all console.* calls across services/api with pino logger - Update tests to spy on logger instead of console Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-26 03:37:28 +00:00
alvis	7281af83a4	feat(bandit): promote egreedy-v2 (D=12, profile features) as active policy (#99 ) Offline sim gate passed — egreedy-v2 mean reward −0.629 vs egreedy-v1 −0.642 (5 users × 20 rounds, rule judge, seed 42). v2 wins 3/5 personas. - recommender.ts: switch remotePolicy() to /score/egreedy/v2 - recommender.ts: switch sendRewardWithRetry() to /reward/egreedy/v2 with profile_features payload so the ridge update uses the full D=12 vector - recommender.ts: re-fetch profile at feedback time (TTL-cached, near-instant) - ADR-0012: status Accepted → Promoted, promotion record appended Shadow entry egreedy-v2-shadow kept in registry (active: false) for rollback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-26 03:08:28 +00:00
alvis	cba3f1a184	docs(services): update integrations + recommender READMEs for signal abstraction (#78 ) integrations/README — replace stale Connector interface and fictional libsodium vault with the actual SignalSource pattern, SQLite token table, and real OAuth routes. recommender/README — document the SignalAggregator pipeline, current policy registry, and actual /recommend + /feedback contract shapes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 17:17:38 +00:00
alvis	352469162d	fix(signals): add missing source field to TaskSyncedEvent (#78 ) TaskSyncedPayload in shared-types and ml/serving schemas both require source, but TaskSyncedEvent in bus.ts and the todoist publish call both omitted it — causing the JetStream consumer to nak every task.synced message on validation failure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 17:15:32 +00:00
alvis	45416000f9	feat(features): per-feature freshness spec — JIT vs batched (#61 ) Each ml/features/*.py now declares freshness, source, and fallback per feature. ProfileFeature gains ttl_sec (mirrored from registry.ts), freshness="batched", source, and fallback. context.py adds ContextFeatureSpec + CONTEXT_FEATURES for the three JIT features (hour_of_day, day_of_week, tasks). CI test parses ttlSec from registry.ts to catch drift. ml/README updated with split JIT/batched feature contract. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 17:02:55 +00:00
alvis	bd3ea1b8b1	docs(schema): update docs for #54 — proto registry + buf CI gate - packages/shared-types/README.md: new — documents HTTP vs event surfaces, proto file layout, schema evolution rules, and how to run buf locally - ml/serving/README.md: note pydantic payload validation in consumer section - CLAUDE.md: replace "schema registry enforced when #54 lands" with the actual state; remove #54 from active-work list Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:53:20 +00:00
alvis	377373a95d	test(schema): unit tests for schemas.py and nats_consumer._handle (#54 ) 17 tests covering: pydantic model validation (all payload types, optional fields, invalid enum values, missing required fields), _handle write path for task_synced, validation errors surfaced through _make_handler causing nak instead of ack. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:51:15 +00:00
alvis	d539fde0c1	feat(schema): protobuf event registry + buf CI gate (#54 ) - Add proto schemas in packages/shared-types/events/ (oo.events.v1): envelope.proto, signals.proto, integration.proto - buf.yaml with STANDARD lint + FILE breaking-change rules - .gitea/workflows/buf-check.yaml: lint + breaking check on every PR touching events/ (needs a Gitea Actions runner to execute) - scripts/buf-check.sh: local equivalent of the CI check - NormalizedEvent TS envelope gains eventId, schemaVersion, producer to align with the proto Envelope message - ml/serving/schemas.py: pydantic models mirroring the v1 proto types - nats_consumer.py: validate payloads via pydantic instead of raw .get() A field-rename PR will now fail buf breaking with exit code 100 and show the offending messages. To make a breaking change: keep the old field reserved, add the new one, bump schema_version to v2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:48:24 +00:00
alvis	f48b5a7646	docs(ml): serving README + update ml/README and CLAUDE.md for #98 - ml/serving/README.md: new — contract, JetStream consumer docs, config, health story, extraction criteria, state file reference - ml/README.md: note JetStream consumers in serving/ row - CLAUDE.md: update active work to reflect #98 shipped, #99 still pending Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 10:21:40 +00:00
alvis	4652e4b582	feat(ml): JetStream durable consumers in ml/serving (#98 ) Adds a NATS JetStream consumer to ml/serving so the feature pipeline can react to events without the API triggering every read. - nats_consumer.py: durable push consumers for signals.> and feedback.> streams; acks on success, naks for redeliver, up to NATS_MAX_DELIVER attempts; per-consumer health state (last_msg_ts, processed, errors) - main.py: FastAPI lifespan wires start/stop; /health exposes nats state - requirements.txt: adds nats-py>=2.9.0 - Dockerfile.ml: copy all *.py from ml/serving (was missing prompts.py) Handled subjects: signals.task.synced → writes per-user sync metadata to STATE_DIR signals.tip.feedback → logged for observability (reward via HTTP path) Config: NATS_URL (empty = disabled), NATS_DURABLE_PREFIX, NATS_MAX_DELIVER Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 10:19:47 +00:00