chore(infra): wire MLflow/Airflow env vars, fix healthcheck, add .dockerignore

- docker-compose: pass ML_SERVING_URL, MLFLOW_URL, AIRFLOW_URL + creds to api service - docker-compose: pass NEXT_PUBLIC_MLFLOW_URL/AIRFLOW_URL to admin service - docker-compose: replace wget healthcheck with node fetch (wget not in node image) - docker-compose: enable Airflow basic_auth API backend; add MLflow pip dep for DAGs - Dockerfiles: tighten layer caching, add .dockerignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(simulate): MLflow tracking, Airflow DAG integration, health checks for mlflow/airflow
2026-04-26 12:08:43 +00:00 · 2026-04-26 12:08:36 +00:00 · 2026-04-26 12:07:43 +00:00 · 2026-04-26 03:41:39 +00:00 · 2026-04-26 03:37:28 +00:00 · 2026-04-26 03:08:28 +00:00
62 changed files with 3198 additions and 278 deletions
--- a/.dockerignore
+++ b/.dockerignore
@@ -0,0 +1,19 @@
 **/node_modules
 **/.next
 **/dist
 **/coverage
 **/.vitest-cache
 **/.turbo
 .git
 .gitea
 .github
 .vscode
 .idea
 **/.env
 **/.env.local
 **/*.log
 docs
 infra/docker/data
 **/__tests__
 **/*.test.ts
 **/*.test.tsx
--- a/.env.example
+++ b/.env.example
@@ -10,6 +10,32 @@ API_BASE_URL=http://localhost:3078
 WEB_BASE_URL=http://localhost:3000
 ML_SERVING_URL=http://localhost:8000
 # MLflow (mlops profile) — http://localhost:5000/mlflow in dev, https://o.alogins.net/mlflow in prod.
 # MLFLOW_ADMIN_PASSWORD seeds the admin account on first boot (changing it after first run
 # requires the MLflow UI or API — see infra/mlflow/basic_auth.ini).
 MLFLOW_URL=http://localhost:5000
 MLFLOW_ADMIN_PASSWORD=change-me
 # Public URL shown as link in the admin sidebar (must be NEXT_PUBLIC_ to reach the browser).
 NEXT_PUBLIC_MLFLOW_URL=http://localhost:5000
 # Airflow (mlops profile) — http://localhost:8080/airflow in dev.
 # Start with: docker compose --profile full --profile mlops up
 AIRFLOW_URL=http://localhost:8080
 AIRFLOW_ADMIN_PASSWORD=change-me
 AIRFLOW_DB_PASSWORD=airflow
 AIRFLOW_SECRET_KEY=change-me-in-prod
 AIRFLOW_FERNET_KEY=
 AIRFLOW_BASE_URL=https://o.alogins.net/airflow
 # Public URL shown as link in the admin sidebar (must be NEXT_PUBLIC_ to reach the browser).
 NEXT_PUBLIC_AIRFLOW_URL=http://localhost:8080
 # Shared secret for Airflow→API internal callbacks. Generate: openssl rand -hex 32
 INTERNAL_API_TOKEN=
 # Static token for automated/service access to the admin panel (e.g. Playwright tests).
 # Leave empty to disable token-based login. Generate: openssl rand -hex 32
 ADMIN_TOKEN=
 # AI stack — shared Agap services (ollama + litellm + langfuse). Not run from oO.
 # Prod: https://llm.alogins.net  |  Dev: http://host.docker.internal:4000 from containers,
 # http://localhost:4000 from host. Ollama: http://host.docker.internal:11434 / :11434.
--- a/.gitea/workflows/buf-check.yaml
+++ b/.gitea/workflows/buf-check.yaml
@@ -0,0 +1,37 @@
 name: buf-check
 on:
  push:
    branches: [main]
    paths:
      - 'packages/shared-types/events/**'
  pull_request:
    paths:
      - 'packages/shared-types/events/**'
 jobs:
  buf:
    name: Lint & breaking-change check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Install buf
        run: |
          BUF_VERSION=1.50.0
          curl -sSfL \
            "https://github.com/bufbuild/buf/releases/download/v${BUF_VERSION}/buf-Linux-x86_64" \
            -o /usr/local/bin/buf
          chmod +x /usr/local/bin/buf
          buf --version
      - name: buf lint
        run: buf lint packages/shared-types/events
      - name: buf breaking
        if: github.event_name == 'pull_request'
        run: |
          buf breaking packages/shared-types/events \
            --against ".git#branch=${{ github.base_ref }},subdir=packages/shared-types/events"
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -56,7 +56,7 @@ docs/              architecture notes, ADRs, API specs
 ## Contracts between modules
 - **HTTP** (OpenAPI, in `packages/shared-types/http/`) — synchronous request/response. In-process today; over the network once extracted. Signatures are identical.
- **Events** (Protocol Buffers, in `packages/shared-types/events/`) — durable signals + feedback. Today: in-process `Bus` with a `onPublish` bridge to NATS JetStream when `NATS_URL` is set (ADR-0010). The in-proc bus stays the source of truth — JetStream is the durable mirror that cross-process consumers (`ml/serving`, future feature pipelines) tail. Schema registry enforced in CI when #54 lands; until then payloads are JSON envelopes (ADR-0005).
+- **Events** (Protocol Buffers, in `packages/shared-types/events/`) — durable signals + feedback. Today: in-process `Bus` with a `onPublish` bridge to NATS JetStream when `NATS_URL` is set (ADR-0010). The in-proc bus stays the source of truth — JetStream is the durable mirror that cross-process consumers (`ml/serving`, future feature pipelines) tail. Proto schemas (ADR-0005) live in `packages/shared-types/events/oo/events/v1/`; `buf lint` + `buf breaking` run in CI on every PR touching those files (`.gitea/workflows/buf-check.yaml`).
 - Do not redefine types per module. Regenerate from `shared-types`.
 ## Conventions
@@ -100,7 +100,7 @@ Ollama and LiteLLM are **shared Agap services**, not oO services — they live i
 **M1 shipped. M2 (AI tips) in progress.** See `README.md` for the phase roadmap and `docs/architecture/` for diagrams. Work is tracked as Gitea milestones + issues on `alvis/oO`.
-Active work: AI tip generation pipeline — issues #86–#93 in M2 milestone.
+Active work: bandit promotion (#99 — offline sim + ADR-0012 pending) and M2 issues (#61 freshness SLAs, #78 signal abstraction, #93 model benchmark).
 ## What NOT to do
@@ -112,3 +112,13 @@ Active work: AI tip generation pipeline — issues #86–#93 in M2 milestone.
 - Don't call LLMs directly from application code. All LLM calls go through `ml/serving` (Python) via `LITELLM_URL`. The TS recommender never holds a model name.
 - Don't embed MLflow/Airflow/OpenWebUI in the admin panel. They are external services; link out to them. The admin shell links to `o.alogins.net/mlflow`, `/airflow`, `ai.alogins.net`.
 - Don't `nats.publish()` directly from feature code. All publishes go through the in-process `Bus` (`services/api/src/events/bus.ts`); the NATS adapter (`events/nats.ts`) bridges every publish to JetStream when `NATS_URL` is set. This keeps subscribers, the ring-buffer tail used by the admin event viewer, and JetStream all in lockstep.
 ## Admin app
 `apps/admin` rewrites `/api/*` → `$NEXT_PUBLIC_API_URL/api/*` via `next.config.ts`. So `apiFetch('/admin/stats')` in `apps/admin/src/lib/api.ts` hits the Express backend, not a Next.js route.
 Running `tsc --noEmit -p apps/admin/tsconfig.json` always reports `Cannot find module 'next'` errors — expected outside the Next.js build context; use `next build` for real type errors.
 ## Auth / session pattern
 Sessions use an `sid` cookie. Admin routes stack `requireAuth` (sets `req.userId`) then `requireAdmin` (checks `role = 'admin'` in DB). Token-based admin auth: `POST /api/auth/token` with `{ token }` matching `ADMIN_TOKEN` env var sets the `sid` cookie — used by Playwright and CI.
--- a/apps/admin/README.md
+++ b/apps/admin/README.md
@@ -8,6 +8,15 @@ Next.js 15 app. Deployed at `admin.o.alogins.net` (dev: `http://localhost:3080`)
  and checks `role === 'admin'`. First admin is seeded via `ADMIN_SEED_EMAIL` env var at API startup.
 - Admin write actions are appended to the `admin_actions` audit log in the DB.
 ## Authentication
 Two ways to sign in:
 | Method | How |
 |--------|-----|
 | Google OAuth | Click "Sign in with Google" on the login page |
 | Token | `POST /api/auth/token` with `{ token }` matching `ADMIN_TOKEN` env var; sets `sid` cookie valid for 24 h. Used by Playwright tests and CI automation. |
 ## Pages
 | Route | Description |
--- a/apps/admin/src/app/login/page.tsx
+++ b/apps/admin/src/app/login/page.tsx
@@ -1,15 +1,67 @@
 'use client';
 import { useState } from 'react';
 import { useRouter } from 'next/navigation';
 export default function LoginPage() {
  const router = useRouter();
  const [token, setToken] = useState('');
  const [error, setError] = useState('');
  const [loading, setLoading] = useState(false);
  async function handleTokenLogin(e: React.FormEvent) {
    e.preventDefault();
    setError('');
    setLoading(true);
    try {
      const res = await fetch('/api/auth/token', {
        method: 'POST',
        credentials: 'include',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ token }),
      });
      if (!res.ok) {
        const data = await res.json().catch(() => ({}));
        setError((data as { error?: string }).error ?? 'Invalid token');
        return;
      }
      router.push('/');
    } catch {
      setError('Request failed');
    } finally {
      setLoading(false);
    }
  }
  return (
    <div className="flex min-h-screen items-center justify-center">
-      <div className="text-center space-y-4">
+      <div className="text-center space-y-6 w-72">
        <h1 className="text-2xl font-semibold">oO Admin</h1>
-        <p className="text-gray-400 text-sm">Sign in via the main app first, then return here.</p>
+
        <a
          href="/sign-in"
          className="inline-block px-4 py-2 bg-white text-black rounded text-sm font-medium hover:bg-gray-200 transition-colors"
        >
          Sign in with Google
        </a>
        <form onSubmit={handleTokenLogin} className="space-y-3">
          <input
            type="password"
            placeholder="Admin token"
            value={token}
            onChange={(e) => setToken(e.target.value)}
            className="w-full px-3 py-2 bg-gray-900 border border-gray-700 rounded text-sm focus:outline-none focus:border-gray-500"
          />
          {error && <p className="text-red-400 text-xs">{error}</p>}
          <button
            type="submit"
            disabled={loading || !token}
            className="w-full px-4 py-2 bg-gray-700 text-white rounded text-sm font-medium hover:bg-gray-600 disabled:opacity-40 transition-colors"
          >
            {loading ? 'Signing in…' : 'Sign in with token'}
          </button>
        </form>
      </div>
    </div>
  );
--- a/apps/admin/src/app/simulate/page.tsx
+++ b/apps/admin/src/app/simulate/page.tsx
@@ -0,0 +1,220 @@
 'use client';
 import { useEffect, useState } from 'react';
 import { AdminShell } from '@/components/AdminShell';
 import {
  startSimulation,
  getSimulationRuns,
  getSimulationRun,
  SimRun,
 } from '@/lib/api';
 const POLICIES = ['linucb-v1', 'egreedy-v1', 'egreedy-v2'];
 const mlflowBase = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
 const airflowBase = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
 function mlflowRunUrl(runId: string) {
  return `${mlflowBase}/#/experiments/1/runs/${runId}`;
 }
 function airflowRunUrl(dagRunId: string) {
  return `${airflowBase}/dags/bandit_sim/grid?dag_run_id=${encodeURIComponent(dagRunId)}`;
 }
 function StatusBadge({ status }: { status: string }) {
  const cls: Record<string, string> = {
    running: 'bg-blue-900 text-blue-300 border-blue-800',
    done:    'bg-green-900 text-green-300 border-green-800',
    failed:  'bg-red-900 text-red-300 border-red-800',
    pending: 'bg-gray-800 text-gray-400 border-gray-700',
  };
  return (
    <span className={`text-xs px-2 py-0.5 rounded border ${cls[status] ?? cls.pending}`}>
      {status}
    </span>
  );
 }
 function SummaryRow({ run }: { run: SimRun }) {
  const summary = run.summaryJson ? JSON.parse(run.summaryJson) as Record<string, { total_reward: number; mean_reward: number; n_pulls: number }> : null;
  return (
    <div className="bg-gray-900 border border-gray-800 rounded p-4 space-y-2">
      <div className="flex items-center justify-between">
        <div className="space-y-0.5">
          <div className="flex items-center gap-2">
            <span className="font-mono text-xs text-gray-500">{run.id}</span>
            <StatusBadge status={run.status} />
            {run.winner && <span className="text-xs text-indigo-400">winner: {run.winner}</span>}
          </div>
          <div className="text-xs text-gray-600">
            {run.nUsers}u × {run.nRounds}r × {run.tasksPerRound}t/r — {run.judgeMode} judge
            {' · '}{new Date(run.createdAt).toLocaleString()}
          </div>
        </div>
        <div className="flex items-center gap-2 flex-shrink-0">
          {run.mlflowRunId && (
            <a href={mlflowRunUrl(run.mlflowRunId)} target="_blank" rel="noreferrer"
               className="text-xs text-indigo-400 hover:underline">MLflow ↗</a>
          )}
          {run.airflowDagRunId && (
            <a href={airflowRunUrl(run.airflowDagRunId)} target="_blank" rel="noreferrer"
               className="text-xs text-indigo-400 hover:underline">Airflow ↗</a>
          )}
        </div>
      </div>
      {summary && (
        <div className="grid grid-cols-2 gap-2 pt-1 lg:grid-cols-3">
          {Object.entries(summary).map(([policy, s]) => (
            <div key={policy} className={`rounded border p-2 text-xs ${policy === run.winner ? 'border-indigo-700 bg-indigo-950' : 'border-gray-800'}`}>
              <div className="font-mono font-medium text-gray-300 mb-1">{policy}</div>
              <div className="text-gray-500 space-y-0.5">
                <div>total <span className="text-gray-300">{s.total_reward.toFixed(2)}</span></div>
                <div>mean <span className="text-gray-300">{s.mean_reward.toFixed(4)}</span></div>
                <div>pulls <span className="text-gray-300">{s.n_pulls}</span></div>
              </div>
            </div>
          ))}
        </div>
      )}
    </div>
  );
 }
 export default function SimulatePage() {
  const [runs, setRuns] = useState<SimRun[]>([]);
  const [loading, setLoading] = useState(true);
  const [launching, setLaunching] = useState(false);
  const [error, setError] = useState('');
  const [msg, setMsg] = useState('');
  const [nUsers, setNUsers]             = useState(5);
  const [nRounds, setNRounds]           = useState(20);
  const [tasksPerRound, setTasksPerRound] = useState(8);
  const [judgeMode, setJudgeMode]       = useState<'rule' | 'llm'>('rule');
  const [selectedPolicies, setSelectedPolicies] = useState<string[]>(['linucb-v1', 'egreedy-v1']);
  const refresh = () =>
    getSimulationRuns()
      .then((r) => setRuns(r.runs))
      .catch((e) => setError(e.message))
      .finally(() => setLoading(false));
  useEffect(() => {
    refresh();
    const t = setInterval(refresh, 8_000);
    return () => clearInterval(t);
  }, []);
  const togglePolicy = (p: string) =>
    setSelectedPolicies((prev) =>
      prev.includes(p) ? prev.filter((x) => x !== p) : [...prev, p],
    );
  const handleLaunch = async () => {
    if (selectedPolicies.length < 2) { setError('Select at least 2 policies.'); return; }
    setLaunching(true); setError(''); setMsg('');
    try {
      const r = await startSimulation({ nUsers, nRounds, tasksPerRound, judgeMode, policies: selectedPolicies });
      setMsg(r.airflow_dag_run_id
        ? `Launched via Airflow — dag_run_id: ${r.airflow_dag_run_id}`
        : `Launched locally — run id: ${r.id}`);
      await refresh();
    } catch (e: unknown) {
      setError((e as Error).message);
    } finally {
      setLaunching(false);
    }
  };
  return (
    <AdminShell>
      <div className="space-y-8 max-w-4xl">
        <h1 className="text-xl font-semibold">Simulations</h1>
        {error && <p className="text-red-400 text-sm">{error}</p>}
        {msg   && <p className="text-green-400 text-sm">{msg}</p>}
        {/* Launch form */}
        <section className="bg-gray-900 border border-gray-800 rounded p-5 space-y-4">
          <h2 className="text-base font-medium text-gray-300">New simulation</h2>
          <div className="grid grid-cols-3 gap-4 text-sm">
            <label className="space-y-1">
              <span className="text-gray-500">Users</span>
              <input type="number" min={1} max={50} value={nUsers}
                onChange={(e) => setNUsers(Number(e.target.value))}
                className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
            </label>
            <label className="space-y-1">
              <span className="text-gray-500">Rounds</span>
              <input type="number" min={1} max={200} value={nRounds}
                onChange={(e) => setNRounds(Number(e.target.value))}
                className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
            </label>
            <label className="space-y-1">
              <span className="text-gray-500">Tasks/round</span>
              <input type="number" min={1} max={20} value={tasksPerRound}
                onChange={(e) => setTasksPerRound(Number(e.target.value))}
                className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
            </label>
          </div>
          <div className="space-y-1 text-sm">
            <span className="text-gray-500">Policies (select ≥ 2)</span>
            <div className="flex gap-2 flex-wrap pt-1">
              {POLICIES.map((p) => (
                <button key={p} onClick={() => togglePolicy(p)}
                  className={`px-3 py-1 rounded border text-xs font-mono ${
                    selectedPolicies.includes(p)
                      ? 'bg-indigo-900 border-indigo-700 text-indigo-200'
                      : 'border-gray-700 text-gray-500 hover:border-gray-500'
                  }`}>
                  {p}
                </button>
              ))}
            </div>
          </div>
          <div className="space-y-1 text-sm">
            <span className="text-gray-500">Judge</span>
            <div className="flex gap-2 pt-1">
              {(['rule', 'llm'] as const).map((m) => (
                <button key={m} onClick={() => setJudgeMode(m)}
                  className={`px-3 py-1 rounded border text-xs ${
                    judgeMode === m
                      ? 'bg-gray-700 border-gray-500 text-white'
                      : 'border-gray-700 text-gray-500 hover:border-gray-500'
                  }`}>
                  {m}
                </button>
              ))}
            </div>
            {judgeMode === 'llm' && (
              <p className="text-xs text-yellow-600 mt-1">LLM judge requires ANTHROPIC_API_KEY in ml/serving env.</p>
            )}
          </div>
          <button onClick={handleLaunch} disabled={launching}
            className="bg-indigo-600 hover:bg-indigo-500 disabled:opacity-50 text-white rounded px-4 py-2 text-sm">
            {launching ? 'Launching…' : 'Launch simulation'}
          </button>
          <p className="text-xs text-gray-600">
            Runs via <a href={airflowBase} target="_blank" rel="noreferrer" className="text-indigo-500 hover:underline">Airflow</a> (mlops profile) when available; falls back to local subprocess.
            Results logged to <a href={mlflowBase} target="_blank" rel="noreferrer" className="text-indigo-500 hover:underline">MLflow</a>.
          </p>
        </section>
        {/* Run history */}
        <section className="space-y-3">
          <h2 className="text-base font-medium text-gray-300">
            Run history
            {loading && <span className="text-xs text-gray-600 ml-2">loading…</span>}
          </h2>
          {runs.length === 0 && !loading && (
            <p className="text-gray-600 text-sm">No simulations yet.</p>
          )}
          {runs.map((r) => <SummaryRow key={r.id} run={r} />)}
        </section>
      </div>
    </AdminShell>
  );
 }
--- a/apps/admin/src/components/AdminShell.tsx
+++ b/apps/admin/src/components/AdminShell.tsx
@@ -2,6 +2,7 @@
 import Link from 'next/link';
 import { usePathname } from 'next/navigation';
 import { useEffect, useState } from 'react';
 const mlflowUrl  = process.env.NEXT_PUBLIC_MLFLOW_URL  ?? '/mlflow';
 const airflowUrl = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
@@ -10,6 +11,7 @@ type NavItem = {
  href: string;
  label: string;
  external?: boolean;
  svcName?: string; // key in the health services map
 };
 type NavSection = {
@@ -31,10 +33,11 @@ const NAV: NavSection[] = [
    ],
  },
  {
-    label: 'Recommender status',
+    label: 'Recommender',
    items: [
      { href: '/tips',             label: 'Tips' },
      { href: '/reward-analytics', label: 'Rewards' },
      { href: '/simulate',         label: 'Simulations' },
    ],
  },
  {
@@ -50,14 +53,33 @@ const NAV: NavSection[] = [
    label: 'Resources',
    items: [
      { href: '/docs',     label: 'Docs' },
-      { href: mlflowUrl, label: 'MLflow ↗', external: true },
+      { href: mlflowUrl,  label: 'MLflow ↗',  external: true, svcName: 'mlflow' },
-      { href: airflowUrl, label: 'Airflow ↗', external: true },
+      { href: airflowUrl, label: 'Airflow ↗', external: true, svcName: 'airflow' },
    ],
  },
 ];
 const STATUS_DOT: Record<string, string> = {
  ok:       'bg-green-500',
  degraded: 'bg-yellow-400',
  down:     'bg-red-500',
 };
 export function AdminShell({ children }: { children: React.ReactNode }) {
  const pathname = usePathname();
  const [svcStatus, setSvcStatus] = useState<Record<string, string>>({});
  useEffect(() => {
    fetch('/api/admin/health', { credentials: 'include' })
      .then((r) => r.json())
      .then((data: { services?: { name: string; status: string }[] }) => {
        const map: Record<string, string> = {};
        for (const s of data.services ?? []) map[s.name] = s.status;
        setSvcStatus(map);
      })
      .catch(() => {});
  }, []);
  return (
    <div className="flex min-h-screen">
      {/* Sidebar */}
@@ -83,13 +105,19 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
                  const active =
                    !item.external &&
                    (item.href === '/' ? pathname === '/' : pathname.startsWith(item.href));
-                  const className = `flex items-center px-3 py-2 rounded text-sm transition-colors ${
+                  const className = `flex items-center gap-2 px-3 py-2 rounded text-sm transition-colors ${
                    active
                      ? 'bg-gray-800 text-white font-medium'
                      : item.external
                        ? 'text-gray-500 hover:text-white hover:bg-gray-900'
                        : 'text-gray-400 hover:text-white hover:bg-gray-900'
                  }`;
                  const dot = item.svcName
                    ? svcStatus[item.svcName]
                      ? <span className={`inline-block w-1.5 h-1.5 rounded-full flex-shrink-0 ${STATUS_DOT[svcStatus[item.svcName]] ?? STATUS_DOT.down}`} />
                      : <span className="inline-block w-1.5 h-1.5 rounded-full flex-shrink-0 bg-gray-700" />
                    : null;
                  return item.external ? (
                    <a
                      key={item.href}
@@ -98,6 +126,7 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
                      rel="noreferrer"
                      className={className}
                    >
                      {dot}
                      {item.label}
                    </a>
                  ) : (
--- a/apps/admin/src/lib/api.ts
+++ b/apps/admin/src/lib/api.ts
@@ -262,3 +262,49 @@ export function saveQuery(name: string, querySql: string) {
 export function deleteSavedQuery(id: string) {
  return apiFetch<{ ok: boolean }>(`/admin/saved-queries/${id}`, { method: 'DELETE' });
 }
 // ── Simulations ────────────────────────────────────────────────────────────
 export interface SimRun {
  id: string;
  policyA: string;
  policyB: string;
  nUsers: number;
  nRounds: number;
  tasksPerRound: number;
  judgeMode: string;
  nPolicies: number;
  status: 'pending' | 'running' | 'done' | 'failed';
  summaryJson: string | null;
  winner: string | null;
  personaBreakdownJson: string | null;
  airflowDagRunId: string | null;
  mlflowRunId: string | null;
  createdAt: string;
  finishedAt: string | null;
 }
 export interface SimStartRequest {
  nUsers?: number;
  nRounds?: number;
  tasksPerRound?: number;
  judgeMode?: 'rule' | 'llm';
  policies?: string[];
 }
 export function startSimulation(req: SimStartRequest) {
  return apiFetch<{ id: string; status: string; airflow_dag_run_id?: string }>(
    '/admin/simulate/start',
    { method: 'POST', body: JSON.stringify(req) },
  );
 }
 export function getSimulationRuns() {
  return apiFetch<{ runs: SimRun[] }>('/admin/simulate/runs');
 }
 export function getSimulationRun(id: string) {
  return apiFetch<{ run: SimRun & { isRunning: boolean }; events: unknown[] }>(
    `/admin/simulate/${id}`,
  );
 }
--- a/apps/admin/tsconfig.tsbuildinfo
+++ b/apps/admin/tsconfig.tsbuildinfo
--- a/docs/adr/0012-egreedy-v2-profile-features.md
+++ b/docs/adr/0012-egreedy-v2-profile-features.md
@@ -1,7 +1,7 @@
 # ADR-0012 — ε-greedy v2: profile features in the bandit (D=7→12)
-**Status:** Accepted  
+**Status:** Promoted  
-**Date:** 2026-04-25  
+**Date:** 2026-04-25 (accepted) / 2026-04-26 (promoted)  
 **Issue:** #99
 ## Context
@@ -106,3 +106,19 @@ projecting theta without the corresponding `A` matrix cannot be done correctly.
 the D=12 target in the issue spec and complicates the sim comparison. Deferred.
 **In-place v1 promotion without shadow** — violates ADR-0002.
 ## Promotion record (2026-04-26)
 Offline sim (`runner.py --policies egreedy-v1 egreedy-v2 --judge rule --n-users 5 --n-rounds 20 --seed 42`):
 | policy | total reward | mean reward | pulls |
 |--------|-------------|-------------|-------|
 | egreedy-v1 | −64.20 | −0.6420 | 100 |
 | egreedy-v2 | −62.90 | −0.6290 | 100 |
 **Gate passed** (v2 mean ≥ v1 mean). Per-persona: v2 wins deadline-driven, evening-relaxed, low-priority-first; v1 wins consistent-responder, overdue-ignorer.
 Changes applied:
 - `recommender.ts` `remotePolicy()`: `/score/egreedy` → `/score/egreedy/v2`
 - `recommender.ts` `sendRewardWithRetry()`: `/reward/egreedy` → `/reward/egreedy/v2`, added `profile_features` to payload
 - Shadow entry `egreedy-v2-shadow` left in registry (`active: false`) for rollback.
--- a/infra/docker/Dockerfile.admin
+++ b/infra/docker/Dockerfile.admin
@@ -1,21 +1,22 @@
-FROM node:22-alpine AS base
+# syntax=docker/dockerfile:1.7
 RUN npm install -g pnpm
-FROM base AS deps
+FROM node:22-slim AS base
-WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates \
-COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
+ && rm -rf /var/lib/apt/lists/* \
-COPY packages/shared-types/package.json ./packages/shared-types/
+ && npm install -g pnpm
-COPY apps/admin/package.json ./apps/admin/
+ENV CI=true \
-RUN pnpm install --frozen-lockfile
+    PNPM_HOME=/pnpm \
    PATH=/pnpm:$PATH
 RUN pnpm config set store-dir /pnpm/store
 FROM base AS builder
 WORKDIR /app
-COPY --from=deps /app/node_modules ./node_modules
+COPY pnpm-lock.yaml ./
-COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm fetch
-COPY --from=deps /app/apps/admin/node_modules ./apps/admin/node_modules
+COPY . .
-COPY tsconfig.base.json ./
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
-COPY packages/shared-types ./packages/shared-types
+    pnpm install --frozen-lockfile --offline \
-COPY apps/admin ./apps/admin
+      --filter @oo/admin... --filter @oo/shared-types
 RUN pnpm --filter @oo/shared-types build
 ARG NEXT_PUBLIC_MLFLOW_URL=/mlflow
 ARG NEXT_PUBLIC_AIRFLOW_URL=/airflow
@@ -24,7 +25,7 @@ ENV NEXT_TELEMETRY_DISABLED=1 \
    NEXT_PUBLIC_AIRFLOW_URL=$NEXT_PUBLIC_AIRFLOW_URL
 RUN pnpm --filter @oo/admin build
-FROM node:22-alpine AS runner
+FROM node:22-slim AS runner
 ENV NODE_ENV=production NEXT_TELEMETRY_DISABLED=1 PORT=3080
 WORKDIR /app
 COPY --from=builder /app/apps/admin/.next/standalone ./
--- a/infra/docker/Dockerfile.api
+++ b/infra/docker/Dockerfile.api
@@ -1,32 +1,35 @@
-FROM node:22-alpine AS base
+# syntax=docker/dockerfile:1.7
 RUN npm install -g pnpm
-FROM base AS deps
+FROM node:22-slim AS base
-WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends \
-COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
+      python3 make g++ ca-certificates \
-COPY packages/shared-types/package.json ./packages/shared-types/
+ && rm -rf /var/lib/apt/lists/* \
-COPY services/api/package.json ./services/api/
+ && npm install -g pnpm
-RUN pnpm install --frozen-lockfile
+ENV CI=true \
    PNPM_HOME=/pnpm \
    PATH=/pnpm:$PATH
 RUN pnpm config set store-dir /pnpm/store
 FROM base AS builder
 WORKDIR /app
-COPY --from=deps /app/node_modules ./node_modules
+COPY pnpm-lock.yaml ./
-COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm fetch
-COPY --from=deps /app/services/api/node_modules ./services/api/node_modules
+COPY . .
-COPY tsconfig.base.json ./
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
-COPY packages/shared-types ./packages/shared-types
+    pnpm install --frozen-lockfile --offline \
-COPY services/api ./services/api
+      --filter @oo/api... --filter @oo/shared-types
 RUN pnpm --filter @oo/shared-types build
 RUN pnpm --filter @oo/api build
 RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
    pnpm --filter @oo/api --prod deploy --legacy /deploy \
 && cp -r services/api/dist /deploy/dist \
 && rm -rf /deploy/node_modules/@oo/shared-types/src \
 && cp -r packages/shared-types/dist /deploy/node_modules/@oo/shared-types/dist
-FROM node:22-alpine AS runner
+FROM node:22-slim AS runner
 WORKDIR /app
-RUN npm install -g pnpm
+ENV NODE_ENV=production
-COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
+COPY --from=builder /deploy/package.json ./
-COPY packages/shared-types/package.json ./packages/shared-types/
+COPY --from=builder /deploy/node_modules ./node_modules
-COPY services/api/package.json ./services/api/
+COPY --from=builder /deploy/dist ./dist
 RUN pnpm install --prod --frozen-lockfile
 COPY --from=builder /app/packages/shared-types/dist ./packages/shared-types/dist
 COPY --from=builder /app/services/api/dist ./services/api/dist
 WORKDIR /app/services/api
 CMD ["node", "dist/index.js"]
--- a/infra/docker/Dockerfile.ml
+++ b/infra/docker/Dockerfile.ml
@@ -2,5 +2,5 @@ FROM python:3.12-slim
 WORKDIR /app
 COPY ml/serving/requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
-COPY ml/serving/main.py .
+COPY ml/serving/*.py .
 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
--- a/infra/docker/docker-compose.yml
+++ b/infra/docker/docker-compose.yml
@@ -11,12 +11,18 @@ services:
    env_file: ../../.env.local
    environment:
      NODE_ENV: production
      ML_SERVING_URL: "http://ml-serving:8000"
      MLFLOW_URL: "http://mlflow:5000"
      AIRFLOW_URL: "http://airflow-webserver:8080"
      AIRFLOW_API_USER: "admin"
      AIRFLOW_API_PASSWORD: "${AIRFLOW_ADMIN_PASSWORD:-admin}"
      INTERNAL_API_TOKEN: "${INTERNAL_API_TOKEN:-}"
    volumes:
      - /mnt/ssd/dbs/oo:/mnt/ssd/dbs/oo
    ports:
      - "127.0.0.1:3078:3078"
    healthcheck:
-      test: ["CMD", "wget", "--spider", "-q", "http://localhost:3078/health"]
+      test: ["CMD", "node", "-e", "fetch('http://localhost:3078/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"]
      interval: 10s
      timeout: 5s
      retries: 5
@@ -49,6 +55,8 @@ services:
      PORT: "3080"
      HOSTNAME: "0.0.0.0"
      NEXT_PUBLIC_API_URL: ""
      NEXT_PUBLIC_MLFLOW_URL: "/mlflow"
      NEXT_PUBLIC_AIRFLOW_URL: "/airflow"
      INTERNAL_API_URL: "http://api:3078"
    ports:
      - "127.0.0.1:3080:3080"
@@ -133,8 +141,14 @@ services:
      AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY:-change-me-in-prod}
      AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
      AIRFLOW__WEBSERVER__BASE_URL: ${AIRFLOW_BASE_URL:-https://o.alogins.net/airflow}
      AIRFLOW__API__AUTH_BACKENDS: "airflow.api.auth.backend.basic_auth"
      _PIP_ADDITIONAL_REQUIREMENTS: "mlflow==2.14.3 httpx"
      MLFLOW_TRACKING_URI: "http://mlflow:5000/mlflow"
      MLFLOW_TRACKING_USERNAME: "admin"
      MLFLOW_TRACKING_PASSWORD: "${MLFLOW_ADMIN_PASSWORD:-password}"
    volumes:
      - ../../ml/pipelines:/opt/airflow/dags:ro
      - ../../ml:/opt/airflow/ml:ro
    ports:
      - "127.0.0.1:8080:8080"
    depends_on:
@@ -155,8 +169,13 @@ services:
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${AIRFLOW_DB_PASSWORD:-airflow}@airflow-db/airflow
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
      _PIP_ADDITIONAL_REQUIREMENTS: "mlflow==2.14.3 httpx"
      MLFLOW_TRACKING_URI: "http://mlflow:5000/mlflow"
      MLFLOW_TRACKING_USERNAME: "admin"
      MLFLOW_TRACKING_PASSWORD: "${MLFLOW_ADMIN_PASSWORD:-password}"
    volumes:
      - ../../ml/pipelines:/opt/airflow/dags:ro
      - ../../ml:/opt/airflow/ml:ro
    depends_on:
      airflow-init:
        condition: service_completed_successfully
--- a/ml/README.md
+++ b/ml/README.md
@@ -4,8 +4,8 @@ Python. Owns models, features, training, online scoring.
 | Dir | Role | Phase |
 |---|---|---|
-| `serving/` | FastAPI online scorer (`/score`, `/generate`) + LiteLLM gateway + prompt registry (`prompts.py`), called by `recommender` | 1–2 |
+| `serving/` | FastAPI online scorer (`/score`, `/generate`) + LiteLLM gateway + prompt registry (`prompts.py`) + JetStream consumers for `signals.>` / `feedback.>`, called by `recommender` | 1–2 |
-| `features/` | context assembler (`context.py`): signals → `PromptContext`; Feast adapter later | 2 |
+| `features/` | context assembler (`context.py`): signals → `PromptContext`; profile-feature schema mirror (`profile_schema.py`); Feast adapter later | 2 |
 | `pipelines/` | batch feature + training DAGs (Prefect/Airflow) | 4 |
 | `registry/` | MLflow-backed model registry integration | 4 |
 | `experiments/` | A/B assignment + multi-armed bandit policies | 4 |
@@ -18,14 +18,24 @@ Python. Owns models, features, training, online scoring.
 - Training reads from the offline feature store; serving reads from the online feature store; definitions are shared (no train/serve skew).
 - Shadow deploys before any policy change that affects real users.
-## Profile-feature contract
+## Feature contract
 ### Profile features (batched)
 User-level features (completion rate, preferred hour, tip volume…) are computed
-by the TypeScript recommender and shipped to ml/serving on every `/score` and
+by the TypeScript recommender and shipped to `ml/serving` on every `/score` and
 `/generate` call as `profile_features: dict | None`. The Python mirror in
-`features/profile_schema.py` documents the available names + dtypes — keep it
+`features/profile_schema.py` documents each feature's name, dtype, TTL, source,
-in sync with `services/api/src/profile/registry.ts` (a CI-style test asserts
+and null fallback — keep it in sync with `services/api/src/profile/registry.ts`
-the name sets match). See ADR-0011.
+(a CI-style test asserts names and `ttlSec` values match). See ADR-0011.
 ### Context features (JIT)
 Request-time signals assembled by `features/context.py` (`hour_of_day`,
 `day_of_week`, task list). These are never cached — they are derived from the
 system clock and the live Todoist feed at the moment of the score call.
 `CONTEXT_FEATURES` in `context.py` declares freshness, source, and fallback for
 each field (issue #61).
 ## Prompt registry
--- a/ml/experiments/sim/runner.py
+++ b/ml/experiments/sim/runner.py
@@ -26,6 +26,7 @@ from __future__ import annotations
 import argparse
 import json
 import os
 import random
 import sys
 import time
@@ -40,6 +41,12 @@ from llm_judge import ACTIONS, infer_reward, judge
 from personas import PERSONAS, Persona
 from task_generator import generate_task_pool
 try:
    import mlflow
    _MLFLOW_AVAILABLE = True
 except ImportError:
    _MLFLOW_AVAILABLE = False
 POLICY_SCORE_ENDPOINTS: dict[str, str] = {
    "linucb-v1": "/score",
    "egreedy-v1": "/score/egreedy",
@@ -107,14 +114,30 @@ def _call_reward(
 # ── Standard single-pass runner (rule / llm modes) ─────────────────────────
 def _init_mlflow(mlflow_url: str | None, experiment: str) -> str | None:
    """Set up MLflow tracking and return the active run_id, or None if unavailable."""
    if not _MLFLOW_AVAILABLE or not mlflow_url:
        return None
    try:
        mlflow.set_tracking_uri(mlflow_url)
        mlflow.set_experiment(experiment)
        return "ready"
    except Exception as e:
        print(f"  [warn] MLflow init failed: {e}", file=sys.stderr)
        return None
 def run_simulation(
    n_users: int, n_rounds: int, tasks_per_round: int,
    ml_url: str, policies: list[str], use_llm: bool, seed: int,
    mlflow_url: str | None = None, mlflow_experiment: str = "bandit_simulation",
 ) -> dict:
    rng = random.Random(seed)
    run_id = str(uuid.uuid4())[:8]
    started_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    _init_mlflow(mlflow_url, mlflow_experiment)
    user_personas = [
        (f"sim-{run_id}-u{i}", PERSONAS[i % len(PERSONAS)])
        for i in range(n_users)
@@ -130,6 +153,26 @@ def run_simulation(
    }
    events: list[dict] = []
    mlflow_run_id: str | None = None
    mlflow_ctx = (
        mlflow.start_run(run_name=run_id)
        if (_MLFLOW_AVAILABLE and mlflow_url)
        else None
    )
    try:
        if mlflow_ctx:
            active = mlflow_ctx.__enter__()
            mlflow_run_id = active.info.run_id
            mlflow.log_params({
                "n_users": n_users,
                "n_rounds": n_rounds,
                "tasks_per_round": tasks_per_round,
                "policies": ",".join(policies),
                "judge": "llm" if use_llm else "rule",
                "seed": seed,
            })
        with httpx.Client(trust_env=False) as client:
            for rnd in range(n_rounds):
                hour = rng.randint(6, 22)
@@ -139,8 +182,6 @@ def run_simulation(
                for user_id, persona in user_personas:
                    seed_tasks = rnd * 997 + abs(hash(user_id)) % 997
                    tasks = generate_task_pool(n=tasks_per_round, seed=seed_tasks)
                # Per-persona profile features for v2 (synthetic for sim — see ADR-0012)
                    profile = persona.profile_features(hour) if hasattr(persona, "profile_features") else None
                    for policy in policies:
@@ -179,13 +220,34 @@ def run_simulation(
                    prev = acc[p]["cumulative_rewards"][-1] if acc[p]["cumulative_rewards"] else 0.0
                    acc[p]["cumulative_rewards"].append(prev + round_rewards[p])
                if mlflow_ctx:
                    for p in policies:
                        mlflow.log_metric(f"{p}_cumulative_reward",
                                          acc[p]["cumulative_rewards"][-1], step=rnd)
                mode = "llm" if use_llm else "rule"
                print(f"  Round {rnd+1:>3}/{n_rounds} [{mode}]  " + "  ".join(
                    f"{p}={acc[p]['cumulative_rewards'][-1]:+.2f}" for p in policies
                ))
-    return _build_result(run_id, started_at, policies, acc, events,
+        result = _build_result(run_id, started_at, policies, acc, events,
                               n_users, n_rounds, tasks_per_round, use_llm, seed)
        result["mlflow_run_id"] = mlflow_run_id
        if mlflow_ctx:
            for p, s in result["summary"].items():
                mlflow.log_metrics({
                    f"{p}_total_reward": s["total_reward"],
                    f"{p}_mean_reward": s["mean_reward"],
                    f"{p}_n_pulls": s["n_pulls"],
                })
            mlflow.set_tag("winner", result["winner"])
        return result
    finally:
        if mlflow_ctx:
            mlflow_ctx.__exit__(None, None, None)
 # ── Claude Code judge — phase 1: score ─────────────────────────────────────
@@ -494,6 +556,9 @@ if __name__ == "__main__":
                        help="Alias for --judge rule (backwards compat)")
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--out", default=None)
    parser.add_argument("--mlflow-url", default=os.environ.get("MLFLOW_TRACKING_URI"),
                        help="MLflow tracking URI (e.g. http://mlflow:5000/mlflow)")
    parser.add_argument("--mlflow-experiment", default="bandit_simulation")
    args = parser.parse_args()
    if args.no_llm:
@@ -534,6 +599,7 @@ if __name__ == "__main__":
            n_users=args.n_users, n_rounds=args.n_rounds,
            tasks_per_round=args.tasks_per_round, ml_url=args.ml_url,
            policies=args.policies, use_llm=use_llm, seed=args.seed,
            mlflow_url=args.mlflow_url, mlflow_experiment=args.mlflow_experiment,
        )
        Path(out_path).write_text(json.dumps(result, indent=2))
        print()
--- a/ml/features/init.py
+++ b/ml/features/init.py
@@ -1,3 +1,8 @@
-from .context import build_context, PromptContext, TaskSignal
+from .context import build_context, PromptContext, TaskSignal, ContextFeatureSpec, CONTEXT_FEATURES
 from .profile_schema import ProfileFeature, PROFILE_FEATURES, feature_names
-__all__ = ["build_context", "PromptContext", "TaskSignal"]
+__all__ = [
    "build_context", "PromptContext", "TaskSignal",
    "ContextFeatureSpec", "CONTEXT_FEATURES",
    "ProfileFeature", "PROFILE_FEATURES", "feature_names",
 ]
--- a/ml/features/context.py
+++ b/ml/features/context.py
@@ -2,12 +2,56 @@
 Context assembler — converts raw user signals into a PromptContext for LLM tip generation.
 Usage:
-    from ml.features.context import build_context
+    from ml.features.context import build_context, CONTEXT_FEATURES
    ctx = build_context(tasks, hour_of_day=9, day_of_week=2)
 Feature-spec (issue #61):
  All context features are JIT — they are assembled at request time from live
  sources (system clock, caller-supplied task list) rather than read from a
  cached profile store. They carry no TTL because they are never persisted.
 """
 from __future__ import annotations
 from dataclasses import dataclass, field
 from typing import Literal
@dataclass(frozen=True)
 class ContextFeatureSpec:
    name: str
    dtype: Literal["numeric", "categorical", "list"]
    freshness: Literal["jit", "batched"]
    source: str
    fallback: str
    description: str
 CONTEXT_FEATURES: tuple[ContextFeatureSpec, ...] = (
    ContextFeatureSpec(
        name="hour_of_day",
        dtype="numeric",
        freshness="jit",
        source="request",
        fallback="12",
        description="Current hour (0–23), supplied by the caller at score time.",
    ),
    ContextFeatureSpec(
        name="day_of_week",
        dtype="numeric",
        freshness="jit",
        source="request",
        fallback="0",
        description="ISO weekday (0=Monday … 6=Sunday), supplied by the caller at score time.",
    ),
    ContextFeatureSpec(
        name="tasks",
        dtype="list",
        freshness="jit",
        source="todoist-integration",
        fallback="[]",
        description="User's open tasks fetched live from the Todoist integration at request time.",
    ),
 )
@dataclass
--- a/ml/features/profile_schema.py
+++ b/ml/features/profile_schema.py
@@ -8,6 +8,12 @@ code (ml/serving, eval harnesses, notebooks) knows what fields to expect on
 Update this file whenever you add or rename a feature in the TS registry.
 The accompanying test asserts the two stay in sync at the name level.
 Feature-spec fields (issue #61):
  freshness — "batched": value cached in profile store, recomputed on TTL/event.
  ttl_sec   — cache lifetime in seconds; mirrors ``ttlSec`` in registry.ts.
  source    — where the value originates.
  fallback  — raw value returned when the feature is unavailable (null stored).
 """
 from __future__ import annotations
@@ -16,6 +22,10 @@ from typing import Literal
 Dtype = Literal["numeric", "categorical"]
 Freshness = Literal["jit", "batched"]
 _HOUR = 3600
 _DAY = 86_400
@dataclass(frozen=True)
@@ -23,28 +33,57 @@ class ProfileFeature:
    name: str
    dtype: Dtype
    description: str
    freshness: Freshness
    ttl_sec: int
    source: str
    fallback: str
 PROFILE_FEATURES: tuple[ProfileFeature, ...] = (
    ProfileFeature(
-        "completion_rate_30d", "numeric",
+        name="completion_rate_30d",
-        'Fraction of tips served in the last 30 days that received a "done" reaction.',
+        dtype="numeric",
        description='Fraction of tips served in the last 30 days that received a "done" reaction.',
        freshness="batched",
        ttl_sec=6 * _HOUR,
        source="profile_store",
        fallback="0.0",
    ),
    ProfileFeature(
-        "dismiss_rate_30d", "numeric",
+        name="dismiss_rate_30d",
-        'Fraction of tips served in the last 30 days that received a "dismiss" reaction.',
+        dtype="numeric",
        description='Fraction of tips served in the last 30 days that received a "dismiss" reaction.',
        freshness="batched",
        ttl_sec=6 * _HOUR,
        source="profile_store",
        fallback="0.0",
    ),
    ProfileFeature(
-        "mean_dwell_ms_30d", "numeric",
+        name="mean_dwell_ms_30d",
-        "Average dwell time (ms between served and reacted) over the last 30 days.",
+        dtype="numeric",
        description="Average dwell time (ms between served and reacted) over the last 30 days.",
        freshness="batched",
        ttl_sec=6 * _HOUR,
        source="profile_store",
        fallback="null — serving normalises to 0.0",
    ),
    ProfileFeature(
-        "preferred_hour", "numeric",
+        name="preferred_hour",
-        'Hour-of-day with the most "done" reactions in the last 30 days (0-23).',
+        dtype="numeric",
        description='Hour-of-day with the most "done" reactions in the last 30 days (0–23).',
        freshness="batched",
        ttl_sec=_DAY,
        source="profile_store",
        fallback="null — serving normalises to 0.5 (neutral alignment)",
    ),
    ProfileFeature(
-        "tip_volume_30d", "numeric",
+        name="tip_volume_30d",
-        "Number of tips served to the user in the last 30 days.",
+        dtype="numeric",
        description="Number of tips served to the user in the last 30 days.",
        freshness="batched",
        ttl_sec=_HOUR,
        source="profile_store",
        fallback="0",
    ),
 )
--- a/ml/features/test_context.py
+++ b/ml/features/test_context.py
@@ -1,7 +1,7 @@
 """Tests for ml/features/context.py"""
 import pytest
 import sys, os; sys.path.insert(0, os.path.dirname(__file__))
-from context import build_context, TaskSignal, PromptContext
+from context import build_context, TaskSignal, PromptContext, CONTEXT_FEATURES
 def test_empty_tasks():
@@ -62,3 +62,30 @@ def test_due_date_none_preserved():
    tasks = [TaskSignal(id="x", content="No due", due_date=None)]
    ctx = build_context(tasks)
    assert ctx.tasks[0]["due_date"] is None
 # ── CONTEXT_FEATURES spec tests (issue #61) ──────────────────────────────────
 def test_context_features_expected_names():
    names = {f.name for f in CONTEXT_FEATURES}
    assert names == {"hour_of_day", "day_of_week", "tasks"}
 def test_context_features_all_jit():
    for f in CONTEXT_FEATURES:
        assert f.freshness == "jit", f"{f.name}: expected freshness='jit', got {f.freshness!r}"
 def test_context_features_source_set():
    for f in CONTEXT_FEATURES:
        assert f.source, f"{f.name}: source must not be empty"
 def test_context_features_fallback_set():
    for f in CONTEXT_FEATURES:
        assert f.fallback, f"{f.name}: fallback must not be empty"
 def test_context_features_no_duplicates():
    names = [f.name for f in CONTEXT_FEATURES]
    assert len(names) == len(set(names)), f"duplicate names: {names}"
--- a/ml/features/test_profile_schema.py
+++ b/ml/features/test_profile_schema.py
@@ -1,4 +1,4 @@
-"""Smoke test for profile_schema mirror (#81 phase A).
+"""Smoke test for profile_schema mirror (#81 phase A, #61 freshness spec).
 The TS registry in services/api/src/profile/registry.ts is the source of truth.
 This test checks the names listed here match the registry by reading the TS
@@ -14,6 +14,18 @@ from ml.features.profile_schema import PROFILE_FEATURES, feature_names
 REGISTRY_PATH = Path(__file__).resolve().parents[2] / "services" / "api" / "src" / "profile" / "registry.ts"
 _HOUR = 3600
 _DAY = 86_400
 # Expected ttl_sec values mirrored from registry.ts — keeps the two in sync.
 _EXPECTED_TTL: dict[str, int] = {
    "completion_rate_30d": 6 * _HOUR,
    "dismiss_rate_30d":    6 * _HOUR,
    "mean_dwell_ms_30d":   6 * _HOUR,
    "preferred_hour":      _DAY,
    "tip_volume_30d":      _HOUR,
 }
 def _ts_registry_names() -> set[str]:
    text = REGISTRY_PATH.read_text(encoding="utf-8")
@@ -21,6 +33,35 @@ def _ts_registry_names() -> set[str]:
    return set(re.findall(r"name:\s*'([a-zA-Z0-9_]+)'", text))
 def _ts_registry_ttls() -> dict[str, int]:
    """Parse ttlSec values from registry.ts (crude but sufficient for drift detection).
    Handles TS symbolic constants (HOUR, DAY) and expressions like ``6 * HOUR``.
    """
    text = REGISTRY_PATH.read_text(encoding="utf-8")
    # Extract numeric constants: `const HOUR = 3600;` or `const DAY = 86_400;`
    consts: dict[str, int] = {}
    for m in re.finditer(r"const\s+([A-Z_]+)\s*=\s*([\d_]+)", text):
        consts[m.group(1)] = int(m.group(2).replace("_", ""))
    def _eval_expr(expr: str) -> int:
        tokens = [t.strip() for t in expr.split("*")]
        result = 1
        for t in tokens:
            result *= consts[t] if t in consts else int(t)
        return result
    result: dict[str, int] = {}
    for block in re.split(r"\{", text):
        name_m = re.search(r"name:\s*'([a-zA-Z0-9_]+)'", block)
        # ttlSec may be a constant name, a number, or `N * CONST`
        ttl_m = re.search(r"ttlSec:\s*([A-Za-z0-9_]+(?:\s*\*\s*[A-Za-z0-9_]+)?)", block)
        if name_m and ttl_m:
            result[name_m.group(1)] = _eval_expr(ttl_m.group(1))
    return result
 def test_python_mirror_matches_ts_registry():
    py_names = feature_names()
    ts_names = _ts_registry_names()
@@ -39,3 +80,34 @@ def test_profile_schema_no_duplicates():
 def test_profile_schema_dtypes_known():
    for f in PROFILE_FEATURES:
        assert f.dtype in {"numeric", "categorical"}
 def test_all_profile_features_are_batched():
    for f in PROFILE_FEATURES:
        assert f.freshness == "batched", f"{f.name}: expected freshness='batched', got {f.freshness!r}"
 def test_profile_feature_ttl_matches_ts_registry():
    ts_ttls = _ts_registry_ttls()
    for f in PROFILE_FEATURES:
        assert f.name in ts_ttls, f"{f.name} not found in TS registry ttlSec parse"
        assert f.ttl_sec == ts_ttls[f.name], (
            f"{f.name}: Python ttl_sec={f.ttl_sec} != TS ttlSec={ts_ttls[f.name]}"
        )
 def test_profile_feature_ttl_matches_expected():
    for f in PROFILE_FEATURES:
        assert f.ttl_sec == _EXPECTED_TTL[f.name], (
            f"{f.name}: ttl_sec={f.ttl_sec}, expected {_EXPECTED_TTL[f.name]}"
        )
 def test_profile_feature_source_is_profile_store():
    for f in PROFILE_FEATURES:
        assert f.source == "profile_store", f"{f.name}: unexpected source {f.source!r}"
 def test_profile_feature_fallback_set():
    for f in PROFILE_FEATURES:
        assert f.fallback, f"{f.name}: fallback must not be empty"
--- a/ml/pipelines/sim_dag.py
+++ b/ml/pipelines/sim_dag.py
@@ -0,0 +1,124 @@
 """
 Airflow DAG: bandit_sim
 Runs a bandit policy simulation and logs results to MLflow.
 Triggered on-demand from the oO admin panel or manually from the Airflow UI.
 Required conf keys (passed via dag_run.conf):
  sim_run_id      str   — oO SQLite run ID for callback correlation
  n_users         int   — number of synthetic users
  n_rounds        int   — rounds per user
  tasks_per_round int   — candidate pool size per round
  policies        list  — policy names to compare
  judge_mode      str   — "rule" | "llm"
  ml_url          str   — ml/serving URL (e.g. http://ml-serving:8000)
  mlflow_url      str   — MLflow tracking URI (e.g. http://mlflow:5000/mlflow)
  callback_url    str   — oO API callback endpoint
  internal_token  str   — x-internal-token header value
 """
 from __future__ import annotations
 import json
 import os
 import sys
 from datetime import datetime, timedelta
 from airflow import DAG
 from airflow.operators.python import PythonOperator
 def _run_sim(**context: object) -> dict:
    conf: dict = context["dag_run"].conf or {}
    n_users        = int(conf.get("n_users", 5))
    n_rounds       = int(conf.get("n_rounds", 20))
    tasks_per_round = int(conf.get("tasks_per_round", 8))
    policies       = list(conf.get("policies", ["linucb-v1", "egreedy-v1"]))
    judge_mode     = str(conf.get("judge_mode", "rule"))
    ml_url         = str(conf.get("ml_url", "http://ml-serving:8000"))
    mlflow_url     = str(conf.get("mlflow_url", os.environ.get("MLFLOW_TRACKING_URI", "")))
    mlflow_experiment = "bandit_simulation"
    sys.path.insert(0, "/opt/airflow/ml/experiments/sim")
    from runner import run_simulation  # type: ignore[import]
    use_llm = judge_mode == "llm"
    result = run_simulation(
        n_users=n_users,
        n_rounds=n_rounds,
        tasks_per_round=tasks_per_round,
        ml_url=ml_url,
        policies=policies,
        use_llm=use_llm,
        seed=42,
        mlflow_url=mlflow_url or None,
        mlflow_experiment=mlflow_experiment,
    )
    return result
 def _callback(**context: object) -> None:
    import httpx
    conf: dict = context["dag_run"].conf or {}
    callback_url: str = str(conf.get("callback_url", ""))
    internal_token: str = str(conf.get("internal_token", ""))
    if not callback_url or not internal_token:
        print("No callback_url or internal_token — skipping result push.", flush=True)
        return
    result: dict = context["ti"].xcom_pull(task_ids="run_sim")
    if not result:
        print("No result from run_sim task — callback skipped.", flush=True)
        return
    payload = {
        "summary":           result.get("summary", {}),
        "winner":            result.get("winner", ""),
        "persona_breakdown": result.get("persona_breakdown", {}),
        "events":            result.get("events", []),
        "mlflow_run_id":     result.get("mlflow_run_id"),
    }
    try:
        r = httpx.post(
            callback_url,
            json=payload,
            headers={"x-internal-token": internal_token},
            timeout=30.0,
        )
        r.raise_for_status()
        print(f"Callback OK: {r.status_code}", flush=True)
    except Exception as exc:
        print(f"Callback failed: {exc}", flush=True)
        raise
 with DAG(
    dag_id="bandit_sim",
    description="On-demand bandit policy simulation with MLflow tracking",
    schedule_interval=None,
    start_date=datetime(2025, 1, 1),
    catchup=False,
    tags=["bandit", "simulation", "ml"],
    default_args={
        "retries": 1,
        "retry_delay": timedelta(minutes=2),
    },
 ) as dag:
    run_sim = PythonOperator(
        task_id="run_sim",
        python_callable=_run_sim,
        provide_context=True,
    )
    push_results = PythonOperator(
        task_id="push_results",
        python_callable=_callback,
        provide_context=True,
    )
    run_sim >> push_results
--- a/ml/serving/README.md
+++ b/ml/serving/README.md
@@ -0,0 +1,104 @@
 # ml/serving
 FastAPI online scorer, tip generator, and JetStream consumer.
 ## Contract
 | Endpoint | Description |
 |----------|-------------|
 | `POST /score` | LinUCB d=5 (baseline, shadow-eligible) |
 | `POST /score/egreedy` | ε-greedy v1, d=7 (active policy — ADR-0007) |
 | `POST /score/egreedy/v2` | ε-greedy v2, d=12 + profile features (shadow — ADR-0012) |
 | `POST /reward` / `/reward/egreedy` / `/reward/egreedy/v2` | Online reward update per policy |
 | `POST /generate` | LLM tip candidates via LiteLLM `tip-generator` alias |
 | `GET /stats/{user_id}` / `/stats/egreedy/{user_id}` / `/stats/egreedy/v2/{user_id}` | Per-user policy stats |
 | `GET /features/{user_id}` | Last 100 scored feature vectors (ring buffer) |
 | `POST /reset/{user_id}` | Clear all per-user bandit state (admin) |
 | `GET /health` | `{ ok, nats: { enabled, consumers: { signals, feedback } } }` |
 Called by `services/api/src/recommender/` over HTTP. Contract is stable across policy swaps.
 ## Feature dimensions
 | Policy | d | Extra dims vs previous |
 |--------|---|------------------------|
 | LinUCB v1 | 5 | hour_sin/cos, is_overdue, task_age, priority |
 | ε-greedy v1 | 7 | + dow_sin/cos |
 | ε-greedy v2 | 12 | + 5 profile features (ADR-0012) |
 Profile features are computed by the TypeScript API and shipped on each `/score` call as `profile_features`. See `ml/README.md` and ADR-0011.
 ## JetStream consumers
 On startup, `nats_consumer.py` registers two durable push consumers against NATS JetStream:
 | Consumer | Stream | Subjects | Durable name |
 |----------|--------|----------|--------------|
 | signals | `signals` | `signals.>` | `feature-pipeline-signals` |
 | feedback | `feedback` | `feedback.>` | `feature-pipeline-feedback` |
 **Handled subjects:**
 - `signals.task.synced` — writes `{last_sync_ts, task_count}` to `{STATE_DIR}/{user}_sync.json`
 - `signals.tip.feedback` — logged for observability; reward update happens via the HTTP path in the recommender
 **Payload validation:** each message is validated against the pydantic models in `schemas.py` (mirroring `packages/shared-types/events/oo/events/v1/`). A `ValidationError` triggers a nak so the message is redelivered rather than silently dropped.
 **Ack semantics:** explicit ack on success; nak for redelivery on error; dead-lettered after `NATS_MAX_DELIVER` attempts.
 **Disabled** when `NATS_URL` is unset (default in local dev without NATS). No import of `nats-py` occurs in that case.
 ## Observability
 Logs are structured JSON via **structlog**. Every line includes `level`, `logger`, `timestamp`, and — when a W3C `traceparent` header is present on the incoming request — `trace_id` bound via Python `contextvars`, so all log lines within a request carry the same trace ID as the upstream API call.
 Sentry error capture is active when `SENTRY_DSN` is set.
 ## Config
 | Env var | Default | Description |
 |---------|---------|-------------|
 | `STATE_DIR` | `/tmp/oo-bandit-state` | Directory for per-user bandit state JSON files |
 | `LITELLM_URL` | `http://localhost:4000` | LiteLLM gateway |
 | `LITELLM_MASTER_KEY` | `sk-oo-dev` | LiteLLM auth key |
 | `NATS_URL` | `` | NATS broker URL; empty = consumers disabled |
 | `NATS_DURABLE_PREFIX` | `feature-pipeline` | Prefix for durable consumer names |
 | `NATS_MAX_DELIVER` | `5` | Max redelivery attempts before dropping |
 | `DEFAULT_PROMPT_VERSION` | `v1` | Fallback prompt version for `/generate` |
 | `ENV` | `development` | Environment label (passed to Sentry) |
 | `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
 ## Health story
 `GET /health` returns `{ ok: true }` plus NATS consumer state:
 ```json
 {
  "ok": true,
  "nats": {
    "enabled": true,
    "consumers": {
      "signals": { "last_msg_ts": "2026-04-25T10:00:00Z", "processed": 42, "errors": 0 },
      "feedback": { "last_msg_ts": null, "processed": 0, "errors": 0 }
    }
  }
 }
 ```
 `last_msg_ts` is `null` until the first message arrives. Used by docker-compose healthcheck.
 ## Extraction criteria
 Extract to its own process (already is one). Extract to a dedicated host / GPU node when:
 - p99 scoring latency exceeds 50 ms under load, **or**
 - model weights are too large to share memory with the Python process on the current host.
 ## State
 Per-user bandit state is stored as JSON files in `STATE_DIR`:
 | File pattern | Policy |
 |---|---|
 | `{user}.json` | LinUCB v1 |
 | `{user}_egreedy.json` | ε-greedy v1 |
 | `{user}_egreedy_v2.json` | ε-greedy v2 |
 | `{user}_sync.json` | Last task sync metadata (written by JetStream consumer) |
--- a/ml/serving/logging_config.py
+++ b/ml/serving/logging_config.py
@@ -0,0 +1,20 @@
 """Structlog JSON configuration — import once at process start."""
 import logging
 import structlog
 def configure() -> None:
    structlog.configure(
        processors=[
            structlog.contextvars.merge_contextvars,
            structlog.stdlib.add_log_level,
            structlog.stdlib.add_logger_name,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.StackInfoRenderer(),
            structlog.processors.JSONRenderer(),
        ],
        wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
        context_class=dict,
        logger_factory=structlog.PrintLoggerFactory(),
    )
    logging.basicConfig(level=logging.WARNING)
--- a/ml/serving/main.py
+++ b/ml/serving/main.py
@@ -28,17 +28,55 @@ import math
 import os
 import time
 from collections import deque
 from contextlib import asynccontextmanager
 from pathlib import Path
 from typing import Optional, Deque
 import httpx
 import numpy as np
-from fastapi import FastAPI, HTTPException
+import sentry_sdk
 import structlog
 import structlog.contextvars
 from fastapi import FastAPI, HTTPException, Request
 from pydantic import BaseModel
 from starlette.middleware.base import BaseHTTPMiddleware
 import logging_config
 import nats_consumer
 from prompts import get_prompt
-app = FastAPI(title="oO ML Serving", version="1.0.0")
+logging_config.configure()
 _SENTRY_DSN = os.getenv("SENTRY_DSN")
 if _SENTRY_DSN:
    sentry_sdk.init(dsn=_SENTRY_DSN, environment=os.getenv("ENV", "development"))
 log = structlog.get_logger()
@asynccontextmanager
 async def lifespan(app: FastAPI):
    await nats_consumer.start(STATE_DIR)
    yield
    await nats_consumer.stop()
 app = FastAPI(title="oO ML Serving", version="1.0.0", lifespan=lifespan)
 class _TracingMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        structlog.contextvars.clear_contextvars()
        traceparent = request.headers.get("traceparent", "")
        if traceparent:
            parts = traceparent.split("-")
            trace_id = parts[1] if len(parts) == 4 and len(parts[1]) == 32 else None
            if trace_id:
                structlog.contextvars.bind_contextvars(trace_id=trace_id)
        return await call_next(request)
 app.add_middleware(_TracingMiddleware)
 LITELLM_URL = os.getenv("LITELLM_URL", "http://localhost:4000")
 LITELLM_MASTER_KEY = os.getenv("LITELLM_MASTER_KEY", "sk-oo-dev")
@@ -315,7 +353,13 @@ class GenerateResponse(BaseModel):
@app.get("/health")
 def health():
-    return {"ok": True}
+    return {
        "ok": True,
        "nats": {
            "enabled": bool(nats_consumer.NATS_URL),
            "consumers": nats_consumer.consumer_health,
        },
    }
 _RETRY_SUFFIX = (
--- a/ml/serving/nats_consumer.py
+++ b/ml/serving/nats_consumer.py
@@ -0,0 +1,146 @@
 """
 JetStream durable consumers for ml/serving.
 Streams:
  signals  (subjects: signals.>) — durable: {prefix}-signals
  feedback (subjects: feedback.>) — durable: {prefix}-feedback
 Handled subjects:
  signals.task.synced   → write per-user sync metadata to STATE_DIR
  signals.tip.feedback  → log for observability (reward is applied via HTTP path)
 Config (env vars):
  NATS_URL            — broker URL; empty = consumers disabled (default: "")
  NATS_DURABLE_PREFIX — prefix for durable consumer names (default: "feature-pipeline")
  NATS_MAX_DELIVER    — max redelivery attempts before dropping (default: 5)
 """
 from __future__ import annotations
 import json
 import os
 import time
 from pathlib import Path
 from typing import Optional
 import structlog
 from schemas import TaskSyncedPayload, TipFeedbackPayload
 log = structlog.get_logger(__name__)
 NATS_URL = os.getenv("NATS_URL", "")
 NATS_DURABLE_PREFIX = os.getenv("NATS_DURABLE_PREFIX", "feature-pipeline")
 NATS_MAX_DELIVER = int(os.getenv("NATS_MAX_DELIVER", "5"))
 # Exposed to /health
 consumer_health: dict[str, dict] = {
    "signals": {"last_msg_ts": None, "processed": 0, "errors": 0},
    "feedback": {"last_msg_ts": None, "processed": 0, "errors": 0},
 }
 _nc = None          # nats.aio.Client
 _subs: list = []    # active JetStream subscriptions
 # ── Subject handlers ───────────────────────────────────────────────────────
 def _sync_meta_path(state_dir: Path, user_id: str) -> Path:
    safe = "".join(c if c.isalnum() else "_" for c in user_id)
    return state_dir / f"{safe}_sync.json"
 async def _handle(subject: str, payload: dict, state_dir: Path) -> None:
    if subject == "signals.task.synced":
        msg = TaskSyncedPayload.model_validate(payload)
        p = _sync_meta_path(state_dir, msg.userId)
        p.write_text(json.dumps({
            "last_sync_ts": msg.syncedAt,
            "task_count": msg.count,
        }))
        log.info("nats: task_synced", user_id=msg.userId, count=msg.count)
    elif subject == "signals.tip.feedback":
        msg = TipFeedbackPayload.model_validate(payload)
        log.info("nats: tip_feedback", user_id=msg.userId, tip_id=msg.tipId, action=msg.action, reward=msg.reward)
    else:
        log.debug("nats: unhandled subject", subject=subject)
 # ── Consumer factory ───────────────────────────────────────────────────────
 def _make_handler(key: str, state_dir: Path):
    """Return an async push-consumer callback that acks on success, naks on error."""
    async def handler(msg) -> None:
        consumer_health[key]["last_msg_ts"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
        try:
            payload = json.loads(msg.data)
            await _handle(msg.subject, payload, state_dir)
            await msg.ack()
            consumer_health[key]["processed"] += 1
        except Exception as exc:
            consumer_health[key]["errors"] += 1
            log.warning("nats: processing error", key=key, subject=msg.subject, exc=str(exc))
            await msg.nak()
    return handler
 # ── Lifecycle ──────────────────────────────────────────────────────────────
 async def start(state_dir: Path) -> None:
    """Connect to NATS and register durable push consumers. No-op if NATS_URL is unset."""
    global _nc
    if not NATS_URL:
        log.info("nats: NATS_URL unset — JetStream consumers disabled")
        return
    try:
        import nats as nats_lib
        from nats.js.api import ConsumerConfig, AckPolicy
        _nc = await nats_lib.connect(
            NATS_URL,
            name="ml-serving",
            reconnect_time_wait=5,
            max_reconnect_attempts=-1,
        )
        js = _nc.jetstream()
        log.info("nats: connected", url=NATS_URL)
    except Exception as exc:
        log.warning("nats: connection failed — consumers disabled", exc=str(exc))
        _nc = None
        return
    config = ConsumerConfig(
        ack_policy=AckPolicy.EXPLICIT,
        max_deliver=NATS_MAX_DELIVER,
    )
    for key, subject in [("signals", "signals.>"), ("feedback", "feedback.>")]:
        durable = f"{NATS_DURABLE_PREFIX}-{key}"
        try:
            sub = await js.subscribe(
                subject,
                durable=durable,
                cb=_make_handler(key, state_dir),
                config=config,
            )
            _subs.append(sub)
            log.info("nats: subscribed", subject=subject, durable=durable)
        except Exception as exc:
            log.warning("nats: subscribe failed", key=key, exc=str(exc))
 async def stop() -> None:
    """Drain subscriptions and close NATS connection."""
    global _nc
    for sub in _subs:
        try:
            await sub.unsubscribe()
        except Exception:
            pass
    _subs.clear()
    if _nc:
        try:
            await _nc.drain()
        except Exception:
            pass
        _nc = None
        log.info("nats: disconnected")
--- a/ml/serving/requirements.txt
+++ b/ml/serving/requirements.txt
@@ -4,3 +4,6 @@ pydantic==2.10.4
 numpy>=1.26.0
 httpx>=0.27.0
 anthropic>=0.40.0
 nats-py>=2.9.0
 structlog>=24.1.0
 sentry-sdk>=2.0.0
--- a/ml/serving/schemas.py
+++ b/ml/serving/schemas.py
@@ -0,0 +1,50 @@
 """
 Pydantic models mirroring oo.events.v1 proto schemas.
 Field names use camelCase to match the proto3 JSON mapping convention
 and the TypeScript payload shapes published by services/api.
 Keep in sync with packages/shared-types/events/oo/events/v1/.
 """
 from __future__ import annotations
 from typing import Literal, Optional
 from pydantic import BaseModel
 class TaskSyncedPayload(BaseModel):
    userId: str
    source: str
    count: int
    syncedAt: str
 class TipServedPayload(BaseModel):
    userId: str
    tipId: str
    policy: str
    servedAt: str
 class TipFeedbackPayload(BaseModel):
    userId: str
    tipId: str
    action: Literal['done', 'dismiss', 'snooze', 'helpful', 'not_helpful']
    reward: float
    dwellMs: Optional[int] = None
    createdAt: str
 class TipRewardFailedPayload(BaseModel):
    userId: str
    tipId: str
    reward: float
    attempts: int
    error: str
    failedAt: str
 class IntegrationTokenExpiredPayload(BaseModel):
    userId: str
    provider: str
    detectedAt: str
--- a/ml/serving/tests/test_schemas_and_consumer.py
+++ b/ml/serving/tests/test_schemas_and_consumer.py
@@ -0,0 +1,169 @@
 """
 Tests for schemas.py and nats_consumer._handle.
 """
 import json
 import pytest
 import tempfile
 from pathlib import Path
 from pydantic import ValidationError
 from unittest.mock import AsyncMock
 from schemas import (
    TaskSyncedPayload,
    TipServedPayload,
    TipFeedbackPayload,
    TipRewardFailedPayload,
    IntegrationTokenExpiredPayload,
 )
 from nats_consumer import _handle, _sync_meta_path
 # ── Schema validation ─────────────────────────────────────────────────────────
 class TestTaskSyncedPayload:
    def test_valid(self):
        p = TaskSyncedPayload.model_validate(
            {"userId": "u1", "source": "todoist", "count": 5, "syncedAt": "2026-04-25T10:00:00Z"}
        )
        assert p.userId == "u1"
        assert p.count == 5
    def test_missing_field_raises(self):
        with pytest.raises(ValidationError):
            TaskSyncedPayload.model_validate({"userId": "u1", "source": "todoist"})
    def test_wrong_type_raises(self):
        with pytest.raises(ValidationError):
            TaskSyncedPayload.model_validate(
                {"userId": "u1", "source": "todoist", "count": "not-an-int", "syncedAt": "2026-04-25T10:00:00Z"}
            )
 class TestTipFeedbackPayload:
    def test_valid_without_dwell(self):
        p = TipFeedbackPayload.model_validate(
            {"userId": "u1", "tipId": "t1", "action": "done", "reward": 1.0, "createdAt": "2026-04-25T10:00:00Z"}
        )
        assert p.dwellMs is None
    def test_valid_with_dwell(self):
        p = TipFeedbackPayload.model_validate(
            {"userId": "u1", "tipId": "t1", "action": "helpful", "reward": 0.5,
             "dwellMs": 3200, "createdAt": "2026-04-25T10:00:00Z"}
        )
        assert p.dwellMs == 3200
    def test_invalid_action_raises(self):
        with pytest.raises(ValidationError):
            TipFeedbackPayload.model_validate(
                {"userId": "u1", "tipId": "t1", "action": "like", "reward": 1.0, "createdAt": "2026-04-25T10:00:00Z"}
            )
    def test_all_valid_actions(self):
        for action in ("done", "dismiss", "snooze", "helpful", "not_helpful"):
            p = TipFeedbackPayload.model_validate(
                {"userId": "u1", "tipId": "t1", "action": action, "reward": 0.0, "createdAt": "2026-04-25T10:00:00Z"}
            )
            assert p.action == action
 class TestOtherPayloads:
    def test_tip_served(self):
        p = TipServedPayload.model_validate(
            {"userId": "u1", "tipId": "t1", "policy": "egreedy-v2", "servedAt": "2026-04-25T10:00:00Z"}
        )
        assert p.policy == "egreedy-v2"
    def test_tip_reward_failed(self):
        p = TipRewardFailedPayload.model_validate(
            {"userId": "u1", "tipId": "t1", "reward": 1.0, "attempts": 3,
             "error": "timeout", "failedAt": "2026-04-25T10:00:00Z"}
        )
        assert p.attempts == 3
    def test_integration_token_expired(self):
        p = IntegrationTokenExpiredPayload.model_validate(
            {"userId": "u1", "provider": "todoist", "detectedAt": "2026-04-25T10:00:00Z"}
        )
        assert p.provider == "todoist"
 # ── _handle behaviour ─────────────────────────────────────────────────────────
 TASK_SYNCED = {
    "userId": "user-abc",
    "source": "todoist",
    "count": 7,
    "syncedAt": "2026-04-25T10:00:00Z",
 }
 TIP_FEEDBACK = {
    "userId": "user-abc",
    "tipId": "tip-xyz",
    "action": "done",
    "reward": 1.0,
    "dwellMs": 4200,
    "createdAt": "2026-04-25T10:00:00Z",
 }
 class TestHandle:
    @pytest.mark.asyncio
    async def test_task_synced_writes_meta_file(self):
        with tempfile.TemporaryDirectory() as tmp:
            state_dir = Path(tmp)
            await _handle("signals.task.synced", TASK_SYNCED, state_dir)
            meta_path = _sync_meta_path(state_dir, "user-abc")
            assert meta_path.exists()
            data = json.loads(meta_path.read_text())
            assert data["task_count"] == 7
            assert data["last_sync_ts"] == "2026-04-25T10:00:00Z"
    @pytest.mark.asyncio
    async def test_task_synced_bad_payload_raises(self):
        with tempfile.TemporaryDirectory() as tmp:
            with pytest.raises(ValidationError):
                await _handle("signals.task.synced", {"userId": "u1"}, Path(tmp))
    @pytest.mark.asyncio
    async def test_tip_feedback_valid_does_not_raise(self):
        with tempfile.TemporaryDirectory() as tmp:
            # should log and return cleanly
            await _handle("signals.tip.feedback", TIP_FEEDBACK, Path(tmp))
    @pytest.mark.asyncio
    async def test_tip_feedback_bad_action_raises(self):
        bad = {**TIP_FEEDBACK, "action": "unknown"}
        with tempfile.TemporaryDirectory() as tmp:
            with pytest.raises(ValidationError):
                await _handle("signals.tip.feedback", bad, Path(tmp))
    @pytest.mark.asyncio
    async def test_unhandled_subject_is_ignored(self):
        with tempfile.TemporaryDirectory() as tmp:
            # should not raise for unknown subjects
            await _handle("signals.something.new", {"any": "data"}, Path(tmp))
    @pytest.mark.asyncio
    async def test_make_handler_acks_on_success(self):
        from nats_consumer import _make_handler
        with tempfile.TemporaryDirectory() as tmp:
            handler = _make_handler("signals", Path(tmp))
            msg = AsyncMock()
            msg.subject = "signals.task.synced"
            msg.data = json.dumps(TASK_SYNCED).encode()
            await handler(msg)
            msg.ack.assert_awaited_once()
            msg.nak.assert_not_awaited()
    @pytest.mark.asyncio
    async def test_make_handler_naks_on_validation_error(self):
        from nats_consumer import _make_handler
        with tempfile.TemporaryDirectory() as tmp:
            handler = _make_handler("signals", Path(tmp))
            msg = AsyncMock()
            msg.subject = "signals.task.synced"
            msg.data = json.dumps({"userId": "u1"}).encode()  # missing fields
            await handler(msg)
            msg.nak.assert_awaited_once()
            msg.ack.assert_not_awaited()
--- a/packages/shared-types/README.md
+++ b/packages/shared-types/README.md
@@ -0,0 +1,63 @@
 # @oo/shared-types
 Canonical contracts for all inter-module communication. Two surfaces:
 | Surface | Format | Location |
 |---------|--------|----------|
 | HTTP (sync) | OpenAPI / TypeScript interfaces | `src/http/` |
 | Events (async) | Protocol Buffers + TS interfaces | `src/events/`, `events/` |
 ## HTTP types
 Hand-written TypeScript interfaces generated from OpenAPI specs. Imported by
 `services/api`, `apps/web`, and `ml/serving` (Python hand-mirrors).
 | File | Types |
 |------|-------|
 | `src/http/tip.ts` | `TipCandidate`, `RecommendResponse`, `TipFeedback` |
 | `src/http/auth.ts` | `SessionUser` |
 | `src/http/integrations.ts` | `IntegrationsResponse`, `Integration` |
 | `src/http/user.ts` | `UserProfile` |
 | `src/http/signal.ts` | `Signal`, `SignalSource` |
 ## Event types
 Protobuf schemas live in `events/oo/events/v1/`. TypeScript interfaces in
 `src/events/index.ts` mirror the proto envelope and payload types.
 | Proto file | Messages |
 |------------|----------|
 | `envelope.proto` | `Envelope` (wraps every event) |
 | `signals.proto` | `TaskSyncedPayload`, `TipServedPayload`, `TipFeedbackPayload`, `TipRewardFailedPayload` |
 | `integration.proto` | `IntegrationTokenExpiredPayload` |
 **Schema evolution rules (ADR-0005):**
 - Additive changes only within a version (new fields, new message types).
 - Removed fields must be marked `reserved` — never reuse a field number.
 - Breaking changes require a new package version (`oo.events.v2`) and a `schemaVersion` bump in the envelope.
 ## Schema registry / CI gate
 `buf` enforces lint and breaking-change detection on every PR that touches `events/`:
 ```bash
 # Lint
 buf lint events/
 # Breaking-change check against main
 buf breaking events/ --against '.git#branch=main,subdir=packages/shared-types/events'
 ```
 Local shortcut: `./scripts/buf-check.sh`
 CI: `.gitea/workflows/buf-check.yaml` (requires a Gitea Actions runner).
 Install buf: `curl -sSfL https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64 -o /usr/local/bin/buf && chmod +x /usr/local/bin/buf`
 ## Contract
 `/health` — not applicable (library package, no process).
 **Extraction criteria** — always a shared library. Extract to a separate registry
 service only when schema governance requires independent versioning and deployment
 (e.g. external consumers, SLA divergence from the monorepo).
--- a/packages/shared-types/events/buf.yaml
+++ b/packages/shared-types/events/buf.yaml
@@ -0,0 +1,7 @@
 version: v1
 lint:
  use:
    - STANDARD
 breaking:
  use:
    - FILE
--- a/packages/shared-types/events/oo/events/v1/envelope.proto
+++ b/packages/shared-types/events/oo/events/v1/envelope.proto
@@ -0,0 +1,25 @@
 syntax = "proto3";
 package oo.events.v1;
 import "oo/events/v1/signals.proto";
 import "oo/events/v1/integration.proto";
 // Envelope wraps every event on the bus and on NATS JetStream.
 // Wire format: proto3 JSON (camelCase field names).
 // schema_version = "v1" — bump to "v2" only for breaking payload changes.
 message Envelope {
  string event_id       = 1;  // UUID assigned by bus on publish
  string occurred_at    = 2;  // ISO 8601
  string schema_version = 3;  // "v1"
  string producer       = 4;  // e.g. "services/api"
  string subject        = 5;  // NATS-style subject: domain.entity.verb
  uint64 seq            = 6;  // monotonic sequence from the bus ring
  oneof payload {
    TaskSyncedPayload              task_synced              = 10;
    TipServedPayload               tip_served               = 11;
    TipFeedbackPayload             tip_feedback             = 12;
    TipRewardFailedPayload         tip_reward_failed        = 13;
    IntegrationTokenExpiredPayload integration_token_expired = 14;
  }
 }
--- a/packages/shared-types/events/oo/events/v1/integration.proto
+++ b/packages/shared-types/events/oo/events/v1/integration.proto
@@ -0,0 +1,9 @@
 syntax = "proto3";
 package oo.events.v1;
 // subject: signals.integration.token_expired
 message IntegrationTokenExpiredPayload {
  string user_id     = 1;
  string provider    = 2;
  string detected_at = 3;  // ISO 8601
 }
--- a/packages/shared-types/events/oo/events/v1/signals.proto
+++ b/packages/shared-types/events/oo/events/v1/signals.proto
@@ -0,0 +1,39 @@
 syntax = "proto3";
 package oo.events.v1;
 // subject: signals.task.synced
 message TaskSyncedPayload {
  string user_id   = 1;
  string source    = 2;  // e.g. "todoist"
  int32  count     = 3;
  string synced_at = 4;  // ISO 8601
 }
 // subject: signals.tip.served
 message TipServedPayload {
  string user_id   = 1;
  string tip_id    = 2;
  string policy    = 3;
  string served_at = 4;  // ISO 8601
 }
 // subject: signals.tip.feedback
 // action: done | dismiss | snooze | helpful | not_helpful
 message TipFeedbackPayload {
  string         user_id    = 1;
  string         tip_id     = 2;
  string         action     = 3;
  double         reward     = 4;
  optional int64 dwell_ms   = 5;  // null when no dwell was recorded
  string         created_at = 6;  // ISO 8601
 }
 // subject: signals.tip.reward_failed
 message TipRewardFailedPayload {
  string user_id   = 1;
  string tip_id    = 2;
  double reward    = 3;
  int32  attempts  = 4;
  string error     = 5;
  string failed_at = 6;  // ISO 8601
 }
--- a/packages/shared-types/package.json
+++ b/packages/shared-types/package.json
@@ -15,7 +15,9 @@
    "test": "vitest run",
    "test:watch": "vitest",
    "type-check": "tsc --noEmit",
-    "clean": "rm -rf dist"
+    "clean": "rm -rf dist",
    "buf:lint": "buf lint events",
    "buf:breaking": "buf breaking events --against '.git#branch=main,subdir=packages/shared-types/events'"
  },
  "devDependencies": {
    "@vitest/coverage-v8": "^4.1.4",
--- a/packages/shared-types/src/events/index.ts
+++ b/packages/shared-types/src/events/index.ts
@@ -1,6 +1,6 @@
 /**
 * NormalizedEvent — the durable envelope for all events flowing through
- * the system. Today: in-process EventEmitter. Tomorrow: NATS JetStream.
+ * the system. Mirrors oo.events.v1.Envelope in packages/shared-types/events/.
 *
 * Subject taxonomy:
 *   signals.task.synced      — Todoist (or other source) task list refreshed
@@ -10,10 +10,16 @@
 *   signals.integration.token_expired — OAuth token needs reconnect
 */
 export interface NormalizedEvent<T = unknown> {
  /** UUID assigned by bus on publish */
  eventId: string;
  /** NATS-style subject: domain.entity.verb */
  subject: string;
  /** ISO 8601 timestamp */
-  ts: string;
+  occurredAt: string;
  /** "v1" — bump for breaking payload changes; see packages/shared-types/events/ */
  schemaVersion: 'v1';
  /** e.g. "services/api" */
  producer: string;
  /** Monotonically increasing sequence number (in-process ring; JetStream seq in prod) */
  seq: number;
  payload: T;
--- a/packages/shared-types/tsconfig.json
+++ b/packages/shared-types/tsconfig.json
@@ -4,5 +4,6 @@
    "outDir": "dist",
    "rootDir": "src"
  },
-  "include": ["src"]
+  "include": ["src"],
  "exclude": ["src/__tests__", "**/*.test.ts"]
 }
--- a/pnpm-lock.yaml
+++ b/pnpm-lock.yaml
--- a/scripts/buf-check.sh
+++ b/scripts/buf-check.sh
@@ -0,0 +1,24 @@
 #!/usr/bin/env bash
 # Run buf lint and breaking-change detection locally.
 # Usage: ./scripts/buf-check.sh [against-branch]
 # Default against-branch: main
 set -euo pipefail
 AGAINST="${1:-main}"
 ROOT="$(cd "$(dirname "$0")/.." && pwd)"
 EVENTS="$ROOT/packages/shared-types/events"
 if ! command -v buf &>/dev/null; then
  echo "buf not found. Install: https://buf.build/docs/installation"
  echo "  curl -sSfL https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64 -o /usr/local/bin/buf && chmod +x /usr/local/bin/buf"
  exit 1
 fi
 echo "==> buf lint"
 buf lint "$EVENTS"
 echo "==> buf breaking against $AGAINST"
 buf breaking "$EVENTS" \
  --against ".git#branch=${AGAINST},subdir=packages/shared-types/events"
 echo "All checks passed."
--- a/services/api/README.md
+++ b/services/api/README.md
@@ -0,0 +1,91 @@
 # services/api
 Express BFF that serves all client-facing routes, manages sessions, runs background signal sync, and proxies admin calls to `ml/serving`.
 ## Contract
 ```
 GET  /health                             { ok: true }
 POST /api/auth/login                     → redirect to Google OAuth
 GET  /api/auth/callback                  OAuth return URL
 POST /api/auth/logout
 GET  /api/auth/session                   → { user? }
 POST /api/auth/token                     { token } → set sid cookie (ADMIN_TOKEN auth)
 GET  /api/integrations                   list connected integrations
 POST /api/integrations/todoist/connect   start Todoist OAuth
 GET  /api/integrations/todoist/callback
 DELETE /api/integrations/:provider       disconnect
 POST /api/recommend                      → { tip }
 POST /api/tip/:id/feedback               { action } → { ok }
 GET  /api/user/profile
 DELETE /api/user                         account deletion
 POST /api/push/subscribe
 DELETE /api/push/subscribe
 GET  /api/admin/stats                    DAU/WAU, feedback breakdown
 GET  /api/admin/users
 GET  /api/admin/events                   recent event stream (ring buffer)
 GET  /api/admin/sim/runs                 offline sim run list
 POST /api/admin/sim/run                  launch offline sim
 GET  /api/admin/sim/runs/:id/output      tail sim stdout
 ...
 GET  /api/ml/*                           admin-only proxy to ml/serving
 ```
 ## Middleware stack (request order)
 1. `cors` — origin limited to `WEB_BASE_URL`
 2. `tracingMiddleware` — reads or generates W3C `traceparent`; sets `req.traceId` + `req.traceparent`
 3. `pinoHttp` — structured JSON request/response logs with `traceId` field; `/health` suppressed
 4. `express.json()` / `cookieParser`
 5. `sessionMiddleware` — validates `sid` cookie, attaches `req.userId`
 ## Observability
 Logs are structured JSON via **pino**. Every line includes `traceId` (extracted from the incoming W3C `traceparent` header, or generated fresh). The same `traceparent` is forwarded on all outbound HTTP calls to `ml/serving` so traces correlate end-to-end.
 Sentry error capture is active when `SENTRY_DSN` is set.
 ## Background tasks
 - **Todoist sync scheduler** — runs every `TODOIST_SYNC_INTERVAL_MS` (default 15 min); starts 10 s after boot to avoid startup surge.
 - **Retention purge** — deletes `tipScores` and `tipFeedback` rows older than 30 days; runs on boot and daily.
 - **Profile TTL invalidation** — listens to `signals.task.synced` and `signals.tip.feedback` on the in-process Bus; invalidates cached user-level profile features so the next `/recommend` gets fresh values.
 ## Config
 | Env var | Default | Description |
 |---------|---------|-------------|
 | `PORT` | `3001` | Listen port |
 | `NODE_ENV` | `development` | Environment label |
 | `DATABASE_PATH` | `./data/oo.db` | SQLite file |
 | `SESSION_SECRET` | required | Cookie signing secret |
 | `GOOGLE_CLIENT_ID/SECRET` | required | OAuth |
 | `TODOIST_CLIENT_ID/SECRET` | required | OAuth |
 | `API_BASE_URL` | `http://localhost:3001` | Self-referential redirect URI |
 | `WEB_BASE_URL` | `http://localhost:3000` | CORS + post-login redirect |
 | `ML_SERVING_URL` | `http://localhost:8000` | ml/serving base URL |
 | `NATS_URL` | `` | NATS broker; empty = in-process bus only |
 | `TODOIST_SYNC_INTERVAL_MS` | `900000` | Background sync cadence |
 | `TIP_PROMPT_VERSION` | `` | Prompt variant(s) for `/generate` |
 | `LOG_LEVEL` | `info` | pino log level |
 | `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
 | `VAPID_*` | | Web push keys |
 | `ADMIN_TOKEN` | `` | Static token for service/Playwright admin auth; empty = disabled |
 ## Health story
 `GET /health` returns `{ ok: true }`. No dependency checks — upstream deps (`ml/serving`, NATS) have their own health endpoints checked separately.
 ## Extraction criteria
 Extract to its own host when:
 - Auth session management needs a dedicated Redis/PG session store, **or**
 - Background sync load (Todoist, future connectors) displaces API serving on the shared host, **or**
 - Team boundary emerges between auth/BFF and recommender orchestration.
--- a/services/api/package.json
+++ b/services/api/package.json
@@ -16,6 +16,7 @@
  },
  "dependencies": {
    "@oo/shared-types": "workspace:*",
    "@sentry/node": "^10.50.0",
    "better-sqlite3": "^11.8.1",
    "cookie-parser": "^1.4.7",
    "cors": "^2.8.5",
@@ -27,6 +28,8 @@
    "nats": "^2.29.3",
    "node-fetch": "^3.3.2",
    "openid-client": "^6.3.4",
    "pino": "^10.3.1",
    "pino-http": "^11.0.0",
    "web-push": "^3.6.7",
    "zod": "^3.24.1"
  },
--- a/services/api/src/config.ts
+++ b/services/api/src/config.ts
@@ -34,6 +34,17 @@ export const config = {
  ML_SERVING_URL: optional('ML_SERVING_URL', 'http://localhost:8000'),
  LITELLM_URL: optional('LITELLM_URL', 'http://localhost:4000'),
  MLFLOW_URL: optional('MLFLOW_URL', 'http://localhost:5000'),
  AIRFLOW_URL: optional('AIRFLOW_URL', 'http://localhost:8080'),
  AIRFLOW_API_USER: optional('AIRFLOW_API_USER', 'admin'),
  AIRFLOW_API_PASSWORD: optional('AIRFLOW_API_PASSWORD', 'admin'),
  /** Shared secret for internal Airflow→API callbacks. */
  INTERNAL_API_TOKEN: optional('INTERNAL_API_TOKEN', ''),
  /** Static token for automated/service access to the admin panel (e.g. Playwright tests). */
  ADMIN_TOKEN: optional('ADMIN_TOKEN', ''),
  VAPID_PUBLIC_KEY: optional('VAPID_PUBLIC_KEY', ''),
  VAPID_PRIVATE_KEY: optional('VAPID_PRIVATE_KEY', ''),
  VAPID_SUBJECT: optional('VAPID_SUBJECT', 'mailto:admin@localhost'),
--- a/services/api/src/db/index.ts
+++ b/services/api/src/db/index.ts
@@ -156,6 +156,10 @@ export function runMigrations() {
    `ALTER TABLE tip_scores ADD COLUMN prompt_version TEXT`,
    `ALTER TABLE tip_scores ADD COLUMN llm_model TEXT`,
    `ALTER TABLE tip_scores ADD COLUMN tip_kind TEXT`,
    `ALTER TABLE sim_runs ADD COLUMN airflow_dag_run_id TEXT`,
    `ALTER TABLE sim_runs ADD COLUMN mlflow_run_id TEXT`,
    `ALTER TABLE sim_runs ADD COLUMN judge_mode TEXT NOT NULL DEFAULT 'rule'`,
    `ALTER TABLE sim_runs ADD COLUMN n_policies INTEGER NOT NULL DEFAULT 2`,
  ]) {
    try { sqlite.exec(stmt); } catch { /* column already exists */ }
  }
--- a/services/api/src/db/schema.ts
+++ b/services/api/src/db/schema.ts
@@ -112,9 +112,13 @@ export const simRuns = sqliteTable('sim_runs', {
  tasksPerRound: integer('tasks_per_round').notNull().default(8),
  useLlm: integer('use_llm', { mode: 'boolean' }).notNull().default(false),
  status: text('status').notNull().default('pending'),  // 'pending'|'running'|'done'|'failed'
  judgeMode: text('judge_mode').notNull().default('rule'),
  nPolicies: integer('n_policies').notNull().default(2),
  summaryJson: text('summary_json'),           // JSON: { [policy]: PolicySummary }
  winner: text('winner'),
  personaBreakdownJson: text('persona_breakdown_json'), // JSON: { [persona]: { [policy]: {reward,n} } }
  airflowDagRunId: text('airflow_dag_run_id'),
  mlflowRunId: text('mlflow_run_id'),
  createdAt: text('created_at').notNull(),
  finishedAt: text('finished_at'),
 });
--- a/services/api/src/events/tests/bus.test.ts
+++ b/services/api/src/events/tests/bus.test.ts
@@ -56,7 +56,7 @@ describe('EventBus — delivery', () => {
  it('does not throw when publishing with no subscribers', () => {
    const b = makeBus();
    expect(() =>
-      b.publish('signals.task.synced', { userId: 'u', count: 3, syncedAt: '' }),
+      b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 3, syncedAt: '' }),
    ).not.toThrow();
  });
@@ -101,7 +101,7 @@ describe('EventBus — ring buffer / tail()', () => {
  it('tail() filters by subject prefix', () => {
    const b = makeBus();
    b.publish('signals.tip.served', { userId: 'u', tipId: 't', policy: 'p', servedAt: '' });
-    b.publish('signals.task.synced', { userId: 'u', count: 1, syncedAt: '' });
+    b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 1, syncedAt: '' });
    const tipEvents = b.tail({ subject: 'signals.tip' });
    expect(tipEvents.every((e) => e.subject.startsWith('signals.tip'))).toBe(true);
@@ -178,7 +178,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
    const hook = vi.fn();
    b.onPublish(hook);
-    const payload = { userId: 'u', count: 2, syncedAt: 'now' };
+    const payload = { userId: 'u', source: 'todoist', count: 2, syncedAt: 'now' };
    b.publish('signals.task.synced', payload);
    expect(hook).toHaveBeenCalledOnce();
@@ -191,7 +191,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
    b.onPublish(() => calls.push('a'));
    b.onPublish(() => calls.push('b'));
-    b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' });
+    b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' });
    expect(calls).toEqual(['a', 'b']);
  });
@@ -202,7 +202,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
    b.onPublish(hook);
    b.subscribe('signals.task.synced', sub);
-    b.publish('signals.task.synced', { userId: 'u', count: 1, syncedAt: '' });
+    b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 1, syncedAt: '' });
    expect(hook).toHaveBeenCalledOnce();
    expect(sub).toHaveBeenCalledOnce();
  });
@@ -215,7 +215,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
      throw new Error('boom');
    });
    expect(() =>
-      b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' }),
+      b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' }),
    ).toThrow('boom');
  });
 });
--- a/services/api/src/events/tests/nats.test.ts
+++ b/services/api/src/events/tests/nats.test.ts
@@ -106,7 +106,7 @@ describe('connectNats — bridge bus → JetStream', () => {
    await connectNats('nats://test:4222');
-    const payload = { userId: 'u1', count: 7, syncedAt: '2026-01-01T00:00:00Z' };
+    const payload = { userId: 'u1', source: 'todoist', count: 7, syncedAt: '2026-01-01T00:00:00Z' };
    bus.publish('signals.task.synced', payload);
    // Allow the queued microtask in the hook to flush.
@@ -121,16 +121,17 @@ describe('connectNats — bridge bus → JetStream', () => {
  it('swallows JetStream publish errors so the in-process bus keeps working', async () => {
    const { connectNats } = await import('../nats.js');
    const { logger } = await import('../../logger.js');
    const { bus } = await import('../bus.js');
    await connectNats('nats://test:4222');
    // Force the next js.publish to reject.
    lastJsPublish.mockRejectedValueOnce(new Error('jetstream down'));
-    const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
+    const errSpy = vi.spyOn(logger, 'error');
    expect(() =>
-      bus.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' }),
+      bus.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' }),
    ).not.toThrow();
    // Wait a tick for the rejected promise's catch to run.
@@ -142,12 +143,16 @@ describe('connectNats — bridge bus → JetStream', () => {
 describe('connectNats — failure mode', () => {
  it('logs a warning and stays silent when connect rejects', async () => {
    const { connectNats } = await import('../nats.js');
    const { logger } = await import('../../logger.js');
    lastConnect.mockRejectedValueOnce(new Error('ECONNREFUSED'));
-    const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {});
+    const warnSpy = vi.spyOn(logger, 'warn');
    await expect(connectNats('nats://nope:4222')).resolves.toBeUndefined();
-    expect(warnSpy).toHaveBeenCalledWith(expect.stringContaining('connection failed'));
+    expect(warnSpy).toHaveBeenCalledWith(
      expect.objectContaining({ err: expect.anything() }),
      expect.stringContaining('connection failed'),
    );
  });
 });
@@ -156,7 +161,7 @@ describe('Bus.onPublish contract — used by NATS bridge', () => {
    const b = new Bus();
    const hook = vi.fn();
    b.onPublish(hook);
-    b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' });
+    b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' });
    expect(hook).toHaveBeenCalledOnce();
  });
 });
--- a/services/api/src/events/bus.ts
+++ b/services/api/src/events/bus.ts
@@ -45,6 +45,7 @@ export type RewardDeliveryFailedEvent = {
 export type TaskSyncedEvent = {
  userId: string;
  source: string;   // e.g. 'todoist'
  count: number;
  syncedAt: string;
 };
--- a/services/api/src/events/nats.ts
+++ b/services/api/src/events/nats.ts
@@ -12,6 +12,7 @@
 import type { NatsConnection, JetStreamClient, StreamConfig } from 'nats';
 import { bus } from './bus.js';
 import { logger } from '../logger.js';
 let nc: NatsConnection | null = null;
 let js: JetStreamClient | null = null;
@@ -67,13 +68,13 @@ export async function connectNats(natsUrl: string): Promise<void> {
      if (!js) return;
      const data = new TextEncoder().encode(JSON.stringify(payload));
      js.publish(subject, data).catch((err: Error) =>
-        console.error(`[nats] publish failed for ${subject}: ${err.message}`),
+        logger.error({ err, subject }, 'nats publish failed'),
      );
    });
-    console.log(`[nats] connected to ${natsUrl}, streams: ${STREAMS.map((s) => s.name).join(', ')}`);
+    logger.info({ url: natsUrl, streams: STREAMS.map((s) => s.name) }, 'nats connected');
  } catch (err: any) {
-    console.warn(`[nats] connection failed — running without JetStream: ${err.message}`);
+    logger.warn({ err }, 'nats connection failed — running without JetStream');
  }
 }
--- a/services/api/src/index.ts
+++ b/services/api/src/index.ts
@@ -1,7 +1,10 @@
 import 'dotenv/config';
 import { logger } from './logger.js';
 import express from 'express';
 import { pinoHttp } from 'pino-http';
 import cookieParser from 'cookie-parser';
 import cors from 'cors';
 import { tracingMiddleware } from './middleware/tracing.js';
 import { config } from './config.js';
 import { db, runMigrations } from './db/index.js';
 import { tipScores, tipFeedback } from './db/schema.js';
@@ -12,7 +15,7 @@ import { integrationsRouter } from './routes/integrations.js';
 import { recommenderRouter } from './routes/recommender.js';
 import { userRouter } from './routes/user.js';
 import { pushRouter } from './routes/push.js';
-import { adminRouter } from './routes/admin.js';
+import { adminRouter, adminInternalRouter } from './routes/admin.js';
 import { mkdir } from 'fs/promises';
 import { dirname } from 'path';
 import { requireAuth } from './middleware/session.js';
@@ -26,13 +29,11 @@ import { registerProfileSubscriptions } from './profile/subscriber.js';
 await mkdir(dirname(config.DATABASE_PATH), { recursive: true });
 runMigrations();
 // Keep the API alive on stray async faults (e.g. a single bad admin route)
 // rather than dropping the whole process.
 process.on('unhandledRejection', (reason) => {
-  console.error('[api] unhandledRejection', reason);
+  logger.error({ err: reason }, 'unhandledRejection');
 });
 process.on('uncaughtException', (err) => {
-  console.error('[api] uncaughtException', err);
+  logger.fatal({ err }, 'uncaughtException');
 });
 const app = express();
@@ -43,6 +44,15 @@ app.use(
    credentials: true,
  }),
 );
 app.use(tracingMiddleware);
 app.use(
  pinoHttp({
    logger,
    genReqId: (req) => req.traceId,
    customProps: (req) => ({ traceId: req.traceId }),
    autoLogging: { ignore: (req) => req.url === '/health' },
  }),
 );
 app.use(express.json());
 app.use(cookieParser());
 app.use(sessionMiddleware);
@@ -55,17 +65,15 @@ app.use('/api', recommenderRouter);
 app.use('/api/user', userRouter);
 app.use('/api/push', pushRouter);
 app.use('/api/admin', adminRouter);
 app.use('/api/admin', adminInternalRouter);
 // Proxy ml/serving endpoints through the API (admin-only).
 // Allows admin UI to call /api/ml/stats/:userId, /api/ml/features/:userId
 // without needing direct access to the ml/serving port.
 app.use('/api/ml', requireAuth as any, requireAdmin as any, async (req: Request, res: Response) => {
  const mlUrl = config.ML_SERVING_URL;
  const target = `${mlUrl}${req.path}`;
  try {
    const upstream = await fetch(target, {
      method: req.method,
-      headers: { 'Content-Type': 'application/json' },
+      headers: { 'Content-Type': 'application/json', traceparent: req.traceparent },
      body: req.method !== 'GET' ? JSON.stringify(req.body) : undefined,
      signal: AbortSignal.timeout(5000),
    });
@@ -82,7 +90,7 @@ async function purgeExpiredData() {
    await db.delete(tipScores).where(lt(tipScores.servedAt, cutoff));
    await db.delete(tipFeedback).where(lt(tipFeedback.createdAt, cutoff));
  } catch (err: any) {
-    console.error(`[purge] retention cleanup failed: ${err.message}`);
+    logger.error({ err }, 'retention cleanup failed');
  }
 }
@@ -90,7 +98,7 @@ purgeExpiredData();
 setInterval(purgeExpiredData, 24 * 60 * 60 * 1000);
 app.listen(config.PORT, () => {
-  console.log(`oO API listening on http://localhost:${config.PORT}`);
+  logger.info({ port: config.PORT }, 'oO API listening');
 });
 if (config.NATS_URL) {
--- a/services/api/src/logger.ts
+++ b/services/api/src/logger.ts
@@ -0,0 +1,12 @@
 import pino from 'pino';
 import * as Sentry from '@sentry/node';
 if (process.env['SENTRY_DSN']) {
  Sentry.init({
    dsn: process.env['SENTRY_DSN'],
    environment: process.env['NODE_ENV'] ?? 'development',
  });
 }
 export const logger = pino({ level: process.env['LOG_LEVEL'] ?? 'info' });
 export { Sentry };
--- a/services/api/src/middleware/tracing.ts
+++ b/services/api/src/middleware/tracing.ts
@@ -0,0 +1,26 @@
 import { randomBytes } from 'crypto';
 import type { Request, Response, NextFunction } from 'express';
 declare global {
  namespace Express {
    interface Request {
      traceId: string;
      traceparent: string;
    }
  }
 }
 export function tracingMiddleware(req: Request, _res: Response, next: NextFunction): void {
  const incoming = req.headers['traceparent'] as string | undefined;
  let traceId: string;
  if (incoming) {
    const parts = incoming.split('-');
    traceId = parts.length === 4 && parts[1]?.length === 32 ? parts[1] : randomBytes(16).toString('hex');
  } else {
    traceId = randomBytes(16).toString('hex');
  }
  const parentId = randomBytes(8).toString('hex');
  req.traceId = traceId;
  req.traceparent = `00-${traceId}-${parentId}-01`;
  next();
 }
--- a/services/api/src/routes/tests/admin.test.ts
+++ b/services/api/src/routes/tests/admin.test.ts
@@ -4,7 +4,7 @@
 * A real Express app + in-memory SQLite DB per test suite.
 * Auth and admin middleware are mocked so we can focus on route logic.
 */
-import { describe, it, expect, vi, beforeAll } from 'vitest';
+import { describe, it, expect, vi, beforeAll, afterEach } from 'vitest';
 import express from 'express';
 import * as http from 'http';
 import { makeTestDb } from '../../test/db.js';
@@ -385,16 +385,126 @@ describe('GET /api/admin/events', () => {
  });
 });
 // ---------------------------------------------------------------------------
 // Health endpoint — mock fetch so tests don't depend on running services.
 // ---------------------------------------------------------------------------
 describe('GET /api/admin/health', () => {
-  it('returns 200 with ok, services array, and checkedAt', async () => {
+  const EXPECTED_HTTP_SERVICES = ['api', 'ml-serving', 'mlflow', 'airflow'] as const;
  const EXPECTED_INTERNAL = ['sqlite', 'event-bus'] as const;
  const VALID_STATUSES = new Set(['ok', 'degraded', 'down']);
  type ServiceRow = { name: string; status: string; latencyMs: number };
  type HealthBody = { ok: boolean; services: ServiceRow[]; checkedAt: string };
  function mockFetch(upServices: Set<string>) {
    // Resolve service name by port (matches defaults in config.ts).
    // Up services return HTTP 200; absent ones throw (simulates connection refused → 'down').
    vi.stubGlobal('fetch', async (url: string) => {
      const s = String(url);
      let name: string;
      if (s.includes(':8000'))      name = 'ml-serving';
      else if (s.includes(':5000')) name = 'mlflow';
      else if (s.includes(':8080')) name = 'airflow';
      else                          name = 'api';
      if (!upServices.has(name)) throw new Error(`ECONNREFUSED ${name}`);
      return { ok: true, json: async () => ({ ok: true, status: 'healthy' }) };
    });
  }
  afterEach(() => vi.unstubAllGlobals());
  it('shape: 200, typed fields, all expected services present', async () => {
    mockFetch(new Set(['api', 'ml-serving', 'mlflow', 'airflow']));
    const { server, call } = await startServer(buildApp());
    try {
      const { status, body } = await call('GET', '/api/admin/health');
-      const b = body as { ok: boolean; services: { name: string; status: string }[]; checkedAt: string };
+      const b = body as HealthBody;
      expect(status).toBe(200);
      expect(typeof b.ok).toBe('boolean');
      expect(Array.isArray(b.services)).toBe(true);
      expect(typeof b.checkedAt).toBe('string');
      expect(new Date(b.checkedAt).getTime()).toBeGreaterThan(0);
      const names = b.services.map((s) => s.name);
      for (const svc of [...EXPECTED_HTTP_SERVICES, ...EXPECTED_INTERNAL]) {
        expect(names).toContain(svc);
      }
      for (const svc of b.services) {
        expect(VALID_STATUSES).toContain(svc.status);
        expect(typeof svc.latencyMs).toBe('number');
      }
    } finally {
      server.close();
    }
  });
  it('ok=true when all HTTP services respond 200', async () => {
    mockFetch(new Set(['api', 'ml-serving', 'mlflow', 'airflow']));
    const { server, call } = await startServer(buildApp());
    try {
      const { body } = await call('GET', '/api/admin/health');
      const b = body as HealthBody;
      for (const name of EXPECTED_HTTP_SERVICES) {
        const svc = b.services.find((s) => s.name === name);
        expect(svc?.status, `${name} should be ok`).toBe('ok');
      }
      expect(b.ok).toBe(true);
    } finally {
      server.close();
    }
  });
  it('ml-serving=down and ok=false when ml-serving is unreachable', async () => {
    mockFetch(new Set(['api', 'mlflow', 'airflow'])); // ml-serving absent
    const { server, call } = await startServer(buildApp());
    try {
      const { body } = await call('GET', '/api/admin/health');
      const b = body as HealthBody;
      const mlSvc = b.services.find((s) => s.name === 'ml-serving');
      expect(mlSvc?.status).toBe('down');
      expect(b.ok).toBe(false);
    } finally {
      server.close();
    }
  });
  it('airflow=down and ok=false when airflow is unreachable', async () => {
    mockFetch(new Set(['api', 'ml-serving', 'mlflow'])); // airflow absent
    const { server, call } = await startServer(buildApp());
    try {
      const { body } = await call('GET', '/api/admin/health');
      const b = body as HealthBody;
      const svc = b.services.find((s) => s.name === 'airflow');
      expect(svc?.status).toBe('down');
      expect(b.ok).toBe(false);
    } finally {
      server.close();
    }
  });
  it('mlflow=down and ok=false when mlflow is unreachable', async () => {
    mockFetch(new Set(['api', 'ml-serving', 'airflow'])); // mlflow absent
    const { server, call } = await startServer(buildApp());
    try {
      const { body } = await call('GET', '/api/admin/health');
      const b = body as HealthBody;
      const svc = b.services.find((s) => s.name === 'mlflow');
      expect(svc?.status).toBe('down');
      expect(b.ok).toBe(false);
    } finally {
      server.close();
    }
  });
  it('sqlite and event-bus are always present regardless of HTTP service status', async () => {
    mockFetch(new Set()); // all HTTP services down
    const { server, call } = await startServer(buildApp());
    try {
      const { body } = await call('GET', '/api/admin/health');
      const b = body as HealthBody;
      expect(b.services.find((s) => s.name === 'sqlite')?.status).toBe('ok');
      expect(b.services.find((s) => s.name === 'event-bus')?.status).toBe('ok');
    } finally {
      server.close();
    }
--- a/services/api/src/routes/admin.ts
+++ b/services/api/src/routes/admin.ts
@@ -1,4 +1,5 @@
-import { type Router as ExpressRouter, Router, Response } from 'express';
+import { type Router as ExpressRouter, Router, Response, type Request } from 'express';
 import { logger } from '../logger.js';
 import { db, rawSqlite } from '../db/index.js';
 import {
  users,
@@ -523,16 +524,24 @@ router.get('/data-quality', async (req: AuthenticatedRequest, res: Response) =>
 // Fan-out to all subsystem /health endpoints.
 // ---------------------------------------------------------------------------
 router.get('/health', async (_req: AuthenticatedRequest, res: Response) => {
-  const checks: Array<{ name: string; url: string }> = [
+  const airflowAuth = Buffer.from(`${config.AIRFLOW_API_USER}:${config.AIRFLOW_API_PASSWORD}`).toString('base64');
-    { name: 'api', url: `http://localhost:${process.env.PORT ?? 3001}/health` },
+
  const checks: Array<{ name: string; url: string; headers?: Record<string, string> }> = [
    { name: 'api',        url: `http://localhost:${config.PORT}/health` },
    { name: 'ml-serving', url: `${config.ML_SERVING_URL}/health` },
    { name: 'mlflow',     url: `${config.MLFLOW_URL}/health` },
    { name: 'airflow',    url: `${config.AIRFLOW_URL}/api/v1/health`,
      headers: { Authorization: `Basic ${airflowAuth}` } },
  ];
  const results = await Promise.allSettled(
-    checks.map(async ({ name, url }) => {
+    checks.map(async ({ name, url, headers }) => {
      const t0 = Date.now();
      try {
-        const r = await fetch(url, { signal: AbortSignal.timeout(3000) });
+        const r = await fetch(url, {
          headers,
          signal: AbortSignal.timeout(3000),
        });
        return { name, status: r.ok ? 'ok' : 'degraded', latencyMs: Date.now() - t0 };
      } catch {
        return { name, status: 'down', latencyMs: Date.now() - t0 };
@@ -548,15 +557,12 @@ router.get('/health', async (_req: AuthenticatedRequest, res: Response) => {
    dbStatus = 'down';
  }
  // Event bus: always ok if process is alive
  const eventBusStatus = 'ok';
  const services = results.map((r) =>
    r.status === 'fulfilled' ? r.value : { name: 'unknown', status: 'down', latencyMs: 0 },
  );
  services.push({ name: 'sqlite',    status: dbStatus, latencyMs: 0 });
-  services.push({ name: 'event-bus', status: eventBusStatus, latencyMs: 0 });
+  services.push({ name: 'event-bus', status: 'ok',     latencyMs: 0 });
  const allOk = services.every((s) => s.status === 'ok');
  res.json({ ok: allOk, services, checkedAt: new Date().toISOString() });
@@ -699,22 +705,21 @@ router.delete('/saved-queries/:id', async (req: AuthenticatedRequest, res: Respo
 // ---------------------------------------------------------------------------
 // POST /api/admin/simulate/start
-// Spawn ml/experiments/sim/runner.py in the background; return run_id.
+// Trigger an Airflow DAG run (bandit_sim). Falls back to a local subprocess
 // when AIRFLOW_URL is not reachable, so local dev still works.
 // ---------------------------------------------------------------------------
 router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response) => {
  const {
    nUsers = 5,
    nRounds = 20,
    tasksPerRound = 8,
    useLlm = false,
    judgeMode = 'rule',
    policies = ['linucb-v1', 'egreedy-v1'],
  } = req.body as {
    nUsers?: number;
    nRounds?: number;
    tasksPerRound?: number;
-    useLlm?: boolean;
+    judgeMode?: 'rule' | 'llm';
    judgeMode?: 'rule' | 'llm' | 'claude-code';
    policies?: string[];
  };
@@ -733,17 +738,69 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
    nUsers,
    nRounds,
    tasksPerRound,
-    useLlm,
+    useLlm: judgeMode === 'llm',
    judgeMode,
    nPolicies: policies.length,
    status: 'running',
    createdAt: now,
  });
  // ── Try Airflow first ────────────────────────────────────────────────────
  if (config.AIRFLOW_URL && config.INTERNAL_API_TOKEN) {
    try {
      const airflowAuth = Buffer.from(
        `${config.AIRFLOW_API_USER}:${config.AIRFLOW_API_PASSWORD}`,
      ).toString('base64');
      const dagRes = await fetch(
        `${config.AIRFLOW_URL}/api/v1/dags/bandit_sim/dagRuns`,
        {
          method: 'POST',
          headers: {
            'Content-Type': 'application/json',
            Authorization: `Basic ${airflowAuth}`,
          },
          body: JSON.stringify({
            conf: {
              sim_run_id: id,
              n_users: nUsers,
              n_rounds: nRounds,
              tasks_per_round: tasksPerRound,
              policies,
              judge_mode: judgeMode,
              ml_url: config.ML_SERVING_URL,
              mlflow_url: config.MLFLOW_URL,
              callback_url: `${config.API_BASE_URL}/api/admin/simulate/${id}/complete`,
              internal_token: config.INTERNAL_API_TOKEN,
            },
          }),
          signal: AbortSignal.timeout(5000),
        },
      );
      if (dagRes.ok) {
        const dagBody = await dagRes.json() as { dag_run_id: string };
        await db
          .update(simRuns)
          .set({ airflowDagRunId: dagBody.dag_run_id })
          .where(eq(simRuns.id, id));
        res.json({ id, status: 'running', airflow_dag_run_id: dagBody.dag_run_id });
        return;
      }
      logger.warn({ status: dagRes.status }, 'sim: Airflow trigger failed, falling back to subprocess');
    } catch (err) {
      logger.warn({ err }, 'sim: Airflow unreachable, falling back to subprocess');
    }
  }
  // ── Subprocess fallback (local dev / Airflow not configured) ────────────
  const runnerPath = resolve(__dirname, '../../../../ml/experiments/sim/runner.py');
  const venvPython = resolve(__dirname, '../../../../ml/serving/.venv/bin/python');
  const pythonBin = existsSync(venvPython) ? venvPython : 'python3';
  const outPath = `/tmp/oo-sim-${id}.json`;
-  const args = [
+  const child = spawn(pythonBin, [
    runnerPath,
    '--n-users', String(nUsers),
    '--n-rounds', String(nRounds),
@@ -751,32 +808,22 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
    '--ml-url', config.ML_SERVING_URL,
    '--policies', ...policies,
    '--out', outPath,
-    '--judge', judgeMode === 'llm' ? 'llm' : judgeMode === 'claude-code' ? 'rule' : 'rule',
+    '--judge', judgeMode,
-    // claude-code mode isn't auto-runnable from the API (requires human in the loop)
+    '--mlflow-url', config.MLFLOW_URL,
-    // it falls back to rule judge when triggered from the panel
+    '--mlflow-experiment', 'bandit_simulation',
-  ];
+  ], { stdio: ['ignore', 'pipe', 'pipe'] });
-  const child = spawn(pythonBin, args, { stdio: ['ignore', 'pipe', 'pipe'] });
+  if (child.pid) _simProcesses.set(id, { pid: child.pid, startedAt: now });
  if (child.pid) {
    _simProcesses.set(id, { pid: child.pid, startedAt: now });
  }
  // Without this listener, a spawn failure (ENOENT when python3 is absent
  // — e.g. in the alpine api container) would emit an unhandled 'error' event
  // and crash the whole API process.
  child.on('error', async (err) => {
-    console.error('[sim] spawn error', err);
+    logger.error({ err }, 'sim: spawn error');
    _simProcesses.delete(id);
-    await db
+    await db.update(simRuns)
      .update(simRuns)
      .set({ status: 'failed', finishedAt: new Date().toISOString() })
      .where(eq(simRuns.id, id));
  });
-  // Capture stderr for debugging
+  child.stderr?.on('data', (d: Buffer) => logger.debug({ stderr: d.toString() }, 'sim stderr'));
  const stderrLines: string[] = [];
  child.stderr?.on('data', (d: Buffer) => stderrLines.push(d.toString()));
  child.on('exit', async (code) => {
    _simProcesses.delete(id);
@@ -785,8 +832,6 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
    if (code === 0 && existsSync(outPath)) {
      try {
        const raw = JSON.parse(readFileSync(outPath, 'utf-8'));
        // Bulk-insert sim events
        const eventRows = (raw.events ?? []).map((ev: Record<string, unknown>) => ({
          id: nanoid(),
          runId: id,
@@ -804,21 +849,19 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
          dayOfWeek: Number(ev.day_of_week),
          createdAt: now,
        }));
        for (const row of eventRows) {
          await db.insert(simEvents).values(row).catch(() => {});
        }
        await db.update(simRuns).set({
          status: 'done',
          summaryJson: JSON.stringify(raw.summary),
          winner: raw.winner,
          personaBreakdownJson: JSON.stringify(raw.persona_breakdown),
          mlflowRunId: raw.mlflow_run_id ?? null,
          finishedAt,
        }).where(eq(simRuns.id, id));
        try { unlinkSync(outPath); } catch { /* ignore */ }
-      } catch (e) {
+      } catch {
        await db.update(simRuns).set({ status: 'failed', finishedAt }).where(eq(simRuns.id, id));
      }
    } else {
@@ -863,4 +906,68 @@ router.get('/simulate/:id', async (req: AuthenticatedRequest, res: Response) =>
  res.json({ run: { ...run, isRunning }, events });
 });
-export { router as adminRouter };
+// ---------------------------------------------------------------------------
 // internalRouter — no session auth; only INTERNAL_API_TOKEN header check.
 // Mounted separately in index.ts at /api/admin to avoid router.use() auth.
 // ---------------------------------------------------------------------------
 const internalRouter: ExpressRouter = Router();
 internalRouter.post('/simulate/:id/complete', async (req: Request, res: Response) => {
  const token = req.headers['x-internal-token'];
  if (!config.INTERNAL_API_TOKEN || token !== config.INTERNAL_API_TOKEN) {
    res.status(401).json({ error: 'Unauthorized' });
    return;
  }
  const { id } = req.params as { id: string };
  const { summary, winner, persona_breakdown, events: rawEvents, mlflow_run_id } =
    req.body as {
      summary: Record<string, unknown>;
      winner: string;
      persona_breakdown: Record<string, unknown>;
      events: Record<string, unknown>[];
      mlflow_run_id?: string;
    };
  const finishedAt = new Date().toISOString();
  const now = finishedAt;
  try {
    const eventRows = (rawEvents ?? []).map((ev) => ({
      id: nanoid(),
      runId: id,
      round: Number(ev['round']),
      userId: String(ev['user_id']),
      persona: String(ev['persona']),
      policy: String(ev['policy']),
      tipContent: String(ev['tip_content']),
      priority: Number(ev['priority']),
      isOverdue: Boolean(ev['is_overdue']),
      action: String(ev['action']),
      dwellMs: ev['dwell_ms'] != null ? Number(ev['dwell_ms']) : null,
      rewardMilli: Math.round(Number(ev['reward']) * 1000),
      hour: Number(ev['hour']),
      dayOfWeek: Number(ev['day_of_week']),
      createdAt: now,
    }));
    for (const row of eventRows) {
      await db.insert(simEvents).values(row).catch(() => {});
    }
    await db.update(simRuns).set({
      status: 'done',
      summaryJson: JSON.stringify(summary),
      winner,
      personaBreakdownJson: JSON.stringify(persona_breakdown),
      mlflowRunId: mlflow_run_id ?? null,
      finishedAt,
    }).where(eq(simRuns.id, id));
    res.json({ ok: true });
  } catch (err) {
    logger.error({ err }, 'sim: complete callback failed');
    await db.update(simRuns).set({ status: 'failed', finishedAt }).where(eq(simRuns.id, id));
    res.status(500).json({ error: 'Failed to store results' });
  }
 });
 export { router as adminRouter, internalRouter as adminInternalRouter };
--- a/services/api/src/routes/auth.ts
+++ b/services/api/src/routes/auth.ts
@@ -5,6 +5,7 @@ import { db } from '../db/index.js';
 import { users, sessions } from '../db/schema.js';
 import { eq } from 'drizzle-orm';
 import { config } from '../config.js';
 import { logger } from '../logger.js';
 const router: ExpressRouter = Router();
@@ -36,7 +37,7 @@ router.get('/login', async (req: Request, res: Response) => {
  setTimeout(() => pendingStates.delete(state), 10 * 60 * 1000);
  const redirectUri = `${config.API_BASE_URL}/api/auth/callback`;
-  console.log('[auth] redirect_uri sent to Google:', redirectUri);
+  logger.info({ redirectUri }, 'auth: redirect_uri');
  const authUrl = client.buildAuthorizationUrl(cfg, {
    redirect_uri: redirectUri,
    scope: 'openid email profile',
@@ -72,7 +73,7 @@ router.get('/callback', async (req: Request, res: Response) => {
      expectedState: state,
    });
  } catch (err) {
-    console.error('OAuth callback error', err);
+    logger.error({ err }, 'auth: OAuth callback error');
    res.status(400).json({ error: 'OAuth error' });
    return;
  }
@@ -123,6 +124,45 @@ router.get('/callback', async (req: Request, res: Response) => {
    .redirect(`${config.WEB_BASE_URL}${pending.redirectTo}`);
 });
 /**
 * POST /api/auth/token
 * Exchange the static ADMIN_TOKEN for a session cookie.
 * Finds the first admin user in the DB; rejects if ADMIN_TOKEN is not configured.
 */
 router.post('/token', async (req: Request, res: Response) => {
  const { token } = req.body as { token?: string };
  if (!config.ADMIN_TOKEN || !token || token !== config.ADMIN_TOKEN) {
    res.status(401).json({ error: 'Invalid token' });
    return;
  }
  const [adminUser] = await db
    .select()
    .from(users)
    .where(eq(users.role, 'admin'))
    .limit(1);
  if (!adminUser) {
    res.status(403).json({ error: 'No admin user exists' });
    return;
  }
  const sid = nanoid(32);
  const now = new Date().toISOString();
  const expiresAt = new Date(Date.now() + 24 * 60 * 60 * 1000).toISOString();
  await db.insert(sessions).values({ id: sid, userId: adminUser.id, expiresAt, createdAt: now });
  res
    .cookie('sid', sid, {
      httpOnly: true,
      secure: config.NODE_ENV === 'production',
      sameSite: 'lax',
      expires: new Date(expiresAt),
      path: '/',
    })
    .json({ ok: true });
 });
 /** POST /api/auth/logout */
 router.post('/logout', async (req: Request, res: Response) => {
  const sid = req.cookies?.sid as string | undefined;
--- a/services/api/src/routes/recommender.ts
+++ b/services/api/src/routes/recommender.ts
@@ -1,5 +1,6 @@
 import { type Router as ExpressRouter, Router, Response } from 'express';
 import { nanoid } from 'nanoid';
 import { logger } from '../logger.js';
 import { db } from '../db/index.js';
 import { integrationTokens, tipFeedback, tipViews, tipScores } from '../db/schema.js';
 import { eq, and, desc } from 'drizzle-orm';
@@ -47,7 +48,8 @@ export const _clearCandidateCacheForTests = () => {
 // Shadow-policy registry
 // ---------------------------------------------------------------------------
 const shadowPolicies = new Map<string, { active: boolean }>([
-  // egreedy-v2 (D=12, profile features) — disabled until sim gate per ADR-0012
+  // egreedy-v2 promoted to active policy (ADR-0012). Shadow entry kept for
  // rollback toggle; leave disabled in normal operation.
  ['egreedy-v2-shadow', { active: false }],
 ]);
@@ -84,6 +86,7 @@ async function remotePolicy(
  userId: string,
  tasks: TipCandidate[],
  profile: Profile,
  traceparent?: string,
 ): Promise<{ tipId: string; score: number; policy: string } | null> {
  const hour = new Date().getHours();
  const dayOfWeek = new Date().getDay();
@@ -101,17 +104,16 @@ async function remotePolicy(
    profile_features: profile,
  };
  // Active policy: egreedy-v1 (selected over linucb-v1 after offline sim — ADR-0007)
  try {
-    const res = await fetch(`${config.ML_SERVING_URL}/score/egreedy`, {
+    const res = await fetch(`${config.ML_SERVING_URL}/score/egreedy/v2`, {
      method: 'POST',
-      headers: { 'Content-Type': 'application/json' },
+      headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
      body: JSON.stringify(body),
      signal: AbortSignal.timeout(3000),
    });
    if (!res.ok) return null;
    const data = (await res.json()) as { tip_id: string; score: number };
-    return { tipId: data.tip_id, score: data.score, policy: 'egreedy-v1' };
+    return { tipId: data.tip_id, score: data.score, policy: 'egreedy-v2' };
  } catch {
    return null;
  }
@@ -145,6 +147,7 @@ async function fetchLlmCandidates(
  dayOfWeek: number,
  promptVersion: string | null,
  profile: Profile,
  traceparent?: string,
 ): Promise<LlmGenerateResult> {
  try {
    const tasks = signals.slice(0, 10).map((s) => ({
@@ -155,7 +158,7 @@ async function fetchLlmCandidates(
    }));
    const res = await fetch(`${config.ML_SERVING_URL}/generate`, {
      method: 'POST',
-      headers: { 'Content-Type': 'application/json' },
+      headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
      body: JSON.stringify({
        user_id: userId,
        context: { tasks, hour_of_day: hour, day_of_week: dayOfWeek },
@@ -225,6 +228,7 @@ router.post('/recommend', requireAuth, async (req: AuthenticatedRequest, res: Re
    dayOfWeek,
    requestedPromptVersion,
    profile,
    req.traceparent,
  );
  const allCandidates: TipCandidate[] = [...signalCandidates, ...llmResult.candidates];
@@ -239,7 +243,7 @@ router.post('/recommend', requireAuth, async (req: AuthenticatedRequest, res: Re
  const t0 = Date.now();
  // Stage 2: score — egreedy bandit with random fallback
-  const scored = await remotePolicy(req.userId!, allCandidates, profile);
+  const scored = await remotePolicy(req.userId!, allCandidates, profile, req.traceparent);
  const latencyMs = Date.now() - t0;
  const tip = scored
    ? (allCandidates.find((t) => t.id === scored.tipId) ?? randomPolicy(allCandidates))
@@ -371,6 +375,8 @@ async function sendRewardWithRetry(
  tipId: string,
  reward: number,
  features: TipCandidate['features'],
  profile: Profile,
  traceparent?: string,
 ): Promise<void> {
  const body = JSON.stringify({
    user_id: userId,
@@ -378,13 +384,14 @@ async function sendRewardWithRetry(
    reward,
    features,
    day_of_week: new Date().getDay(),
    profile_features: profile,
  });
  for (let attempt = 1; attempt <= 3; attempt++) {
    try {
-      const res = await fetch(`${config.ML_SERVING_URL}/reward/egreedy`, {
+      const res = await fetch(`${config.ML_SERVING_URL}/reward/egreedy/v2`, {
        method: 'POST',
-        headers: { 'Content-Type': 'application/json' },
+        headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
        body,
        signal: AbortSignal.timeout(3000),
      });
@@ -392,7 +399,7 @@ async function sendRewardWithRetry(
      throw new Error(`HTTP ${res.status}`);
    } catch (err: any) {
      if (attempt === 3) {
-        console.error(`[reward] failed after 3 attempts for tip ${tipId}: ${err.message}`);
+        logger.error({ tipId, err }, 'reward: failed after 3 attempts');
        bus.publish('signals.tip.reward_failed', {
          userId,
          tipId,
@@ -463,7 +470,9 @@ router.post('/tip/:id/feedback', requireAuth, async (req: AuthenticatedRequest,
  });
  if (candidate) {
-    sendRewardWithRetry(req.userId!, tipId, reward, candidate.features);
+    // Re-fetch profile for the v2 ridge update; TTL cache makes this near-instant.
    const profile = await getProfile(req.userId!);
    sendRewardWithRetry(req.userId!, tipId, reward, candidate.features, profile, req.traceparent);
  }
  // Delegate action to the owning signal source (e.g. mark done in Todoist)
--- a/services/api/src/signals/tests/scheduler.test.ts
+++ b/services/api/src/signals/tests/scheduler.test.ts
@@ -8,6 +8,11 @@
 */
 import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
 vi.mock('../../logger.js', () => ({
  logger: { info: vi.fn(), warn: vi.fn(), error: vi.fn(), fatal: vi.fn() },
 }));
 import { logger } from '../../logger.js';
 // ── mock the drizzle query chain: db.select(...).from(...).where(...) ────────
 let users: { userId: string }[] = [];
 const whereMock = vi.fn(async () => users);
@@ -35,6 +40,7 @@ beforeEach(() => {
  whereMock.mockClear();
  fromMock.mockClear();
  selectMock.mockClear();
  vi.clearAllMocks();
  vi.useFakeTimers();
 });
@@ -102,8 +108,6 @@ describe('startTodoistSyncScheduler', () => {
      if (id === 'bad') throw new Error('todoist 401');
      return [];
    });
    const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
    const logSpy = vi.spyOn(console, 'log').mockImplementation(() => {});
    startTodoistSyncScheduler(60_000);
    await vi.advanceTimersByTimeAsync(10_001);
@@ -112,19 +116,27 @@ describe('startTodoistSyncScheduler', () => {
    await Promise.resolve();
    expect(fetchSignalsMock).toHaveBeenCalledTimes(3);
-    expect(errSpy).toHaveBeenCalledWith(expect.stringContaining('sync error'), expect.anything());
+    expect(logger.error).toHaveBeenCalledWith(
-    expect(logSpy).toHaveBeenCalledWith(expect.stringContaining('2 ok, 1 failed'));
+      expect.objectContaining({ err: expect.anything() }),
      'scheduler: sync error',
    );
    expect(logger.info).toHaveBeenCalledWith(
      expect.objectContaining({ ok: 2, failed: 1 }),
      'scheduler: todoist sync',
    );
  });
  it('survives a db query failure — logs and skips the tick', async () => {
    const { startTodoistSyncScheduler } = await import('../scheduler.js');
    whereMock.mockRejectedValueOnce(new Error('sqlite locked'));
    const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
    startTodoistSyncScheduler(60_000);
    await vi.advanceTimersByTimeAsync(10_001);
    expect(fetchSignalsMock).not.toHaveBeenCalled();
-    expect(errSpy).toHaveBeenCalledWith(expect.stringContaining('failed to query users'));
+    expect(logger.error).toHaveBeenCalledWith(
      expect.objectContaining({ err: expect.anything() }),
      'scheduler: failed to query users',
    );
  });
 });
--- a/services/api/src/signals/aggregator.ts
+++ b/services/api/src/signals/aggregator.ts
@@ -1,4 +1,5 @@
 import type { Signal, SignalSource } from '@oo/shared-types';
 import { logger } from '../logger.js';
 /**
 * Merges signals from all registered sources for a user.
@@ -24,7 +25,7 @@ export class SignalAggregator {
      if (r.status === 'fulfilled') {
        signals.push(...r.value);
      } else {
-        console.error(`[aggregator] source '${this.sources[i].id}' failed:`, r.reason);
+        logger.error({ sourceId: this.sources[i]!.id, err: r.reason }, 'aggregator: source failed');
      }
    }
    return signals;
--- a/services/api/src/signals/scheduler.ts
+++ b/services/api/src/signals/scheduler.ts
@@ -13,6 +13,7 @@ import { db } from '../db/index.js';
 import { integrationTokens } from '../db/schema.js';
 import { eq } from 'drizzle-orm';
 import { todoistSource } from './todoist.js';
 import { logger } from '../logger.js';
 const DEFAULT_INTERVAL_MS = 15 * 60 * 1000;
@@ -25,7 +26,7 @@ export function startTodoistSyncScheduler(intervalMs = DEFAULT_INTERVAL_MS): Nod
        .from(integrationTokens)
        .where(eq(integrationTokens.tokenStatus, 'active'));
    } catch (err: any) {
-      console.error(`[scheduler] failed to query users: ${err.message}`);
+      logger.error({ err }, 'scheduler: failed to query users');
      return;
    }
@@ -39,10 +40,10 @@ export function startTodoistSyncScheduler(intervalMs = DEFAULT_INTERVAL_MS): Nod
    let failed = 0;
    for (const r of results) {
      if (r.status === 'fulfilled') ok++;
-      else { failed++; console.error(`[scheduler] sync error:`, r.reason); }
+      else { failed++; logger.error({ err: r.reason }, 'scheduler: sync error'); }
    }
-    console.log(`[scheduler] todoist sync: ${ok} ok, ${failed} failed (${users.length} users)`);
+    logger.info({ ok, failed, total: users.length }, 'scheduler: todoist sync');
  }
  // Run once shortly after startup, then on interval
--- a/services/api/src/signals/todoist.ts
+++ b/services/api/src/signals/todoist.ts
@@ -3,6 +3,7 @@ import { db } from '../db/index.js';
 import { integrationTokens } from '../db/schema.js';
 import { eq, and } from 'drizzle-orm';
 import { bus } from '../events/bus.js';
 import { logger } from '../logger.js';
 const CACHE_TTL_MS = 30_000;
@@ -46,7 +47,7 @@ export class TodoistSignalSource implements SignalSource {
    if (!res.ok) {
      if (res.status === 401) {
-        console.error(`[todoist] token expired for user ${userId}`);
+        logger.warn({ userId }, 'todoist: token expired');
        bus.publish('signals.integration.token_expired', {
          userId,
          provider: 'todoist',
@@ -88,7 +89,7 @@ export class TodoistSignalSource implements SignalSource {
    });
    this.cache.set(userId, { signals, fetchedAt: Date.now() });
-    bus.publish('signals.task.synced', { userId, count: signals.length, syncedAt: now });
+    bus.publish('signals.task.synced', { userId, source: 'todoist', count: signals.length, syncedAt: now });
    return signals;
  }
--- a/services/integrations/README.md
+++ b/services/integrations/README.md
@@ -2,30 +2,49 @@
 Third-party connectors and the token vault.
-## Connector interface
+## Signal source interface
 Each connector implements `SignalSource` from `@oo/shared-types`:
 ```ts
-interface Connector {
+interface SignalSource {
-  id: string                                // e.g. "todoist"
+  readonly id: string                                       // e.g. "todoist"
-  scopes: string[]                          // human-readable list shown in consent UI
+  fetchSignals(userId: string): Promise<Signal[]>          // returns normalized Signal[]
-  beginOAuth(user): Promise<{ redirectUrl, state }>
+  act?(userId: string, signalId: string, action: string): Promise<void>  // optional write-back
  finishOAuth(code, state): Promise<StoredCredential>
  fetchSignals(user, since?): AsyncIterable<NormalizedEvent>
  // incremental-sync cursor (Todoist sync_token, webhook timestamps, etc.)
  // stored in Credential.meta; the connector owns its shape.
  act?(user, action): Promise<void>          // optional write-back (complete task, etc.)
  revoke(user): Promise<void>                // REQUIRED: provider-side token revocation on disconnect
 }
 ```
 `SignalAggregator` (`services/api/src/signals/aggregator.ts`) fans out to all registered sources in parallel, isolating per-source failures.
 ## Token vault
- Credentials encrypted at rest (libsodium sealed box); key from env/KMS.
+OAuth tokens stored in the `integration_tokens` SQLite table (`services/api/src/db/schema.ts`):
 - Refresh handled transparently; consumers never see raw tokens.
 - One row per `(user, provider)` with provider-specific `meta`.
-## Roadmap
+| Column | Description |
 |--------|-------------|
 | `userId` | owner |
 | `provider` | e.g. `todoist` |
 | `accessToken` | OAuth access token (plain in dev; encrypted in prod via server secret store) |
 | `tokenStatus` | `active` \| `needs_reconnect` |
- Phase 0: **Todoist** (OAuth2, read tasks, complete task).
+On a 401 from the upstream API, the connector marks the token `needs_reconnect` and publishes `signals.integration.token_expired` so the client can prompt re-auth.
- Phase 2: Google Calendar, Apple Health (web import), generic webhook ingress.
+
- Phase 5: public SDK so third parties can ship connectors.
+## Routes
 | Method | Path | Description |
 |--------|------|-------------|
 | `GET` | `/api/integrations` | List connected integrations for current user |
 | `GET` | `/api/integrations/todoist/connect` | Start Todoist OAuth flow |
 | `GET` | `/api/integrations/todoist/callback` | OAuth callback — exchange code, store token |
 | `DELETE` | `/api/integrations/:provider` | Disconnect + delete token |
 ## Connectors
 | Connector | Status | Signals produced |
 |-----------|--------|-----------------|
 | Todoist | Phase 1 — active | `task` signals (today + overdue); `done` write-back |
 | Google Calendar | Phase 2 — planned | `event` signals |
 ## Extraction criteria
 Extract to its own process when credential blast-radius isolation requires it (e.g. token vault with KMS-backed encryption needs to run in a hardened sidecar) or when connector volume justifies separate scaling.
--- a/services/recommender/README.md
+++ b/services/recommender/README.md
@@ -1,29 +1,42 @@
 # recommender
-The core of oO. Takes a user + a context, returns **one** tip.
+The core of oO. Takes a user + context, returns **one** tip.
 ## Contract
 ```
-POST /recommend
+POST /api/recommend
-  { user_id, context?: { time, timezone, client, ... } }
+  { }  (user inferred from session)
-  → { tip: { id, kind: "todo"|"advice", title, body, source, deep_link, meta } }
+  → { tip: { id, content, source, kind, sourceId?, rationale?, createdAt } }
-POST /feedback
+POST /api/tip/:id/feedback
-  { user_id, tip_id, reaction: "done"|"snooze"|"dismiss", at }
+  { action: "done"|"dismiss"|"snooze"|"helpful"|"not_helpful", dwellMs? }
  → { ok: true }
 ```
-## Internals (stable seams)
+## Pipeline
- **Candidate sources** — pluggable async generators. v0: Todoist tasks via `integrations`. Later: advice library, calendar nudges, health prompts.
+1. **Signals** — `SignalAggregator.fetchAll(userId)` fans out to all registered `SignalSource` implementations in parallel. Currently: `TodoistSignalSource`. Add a source via `aggregator.register(new MySource())`.
- **Feature assembler** — fills the `context` blob (inline in Phase 0; calls feature store from M1). Never inlined into policy code.
+2. **LLM candidates** — `POST /generate` on `ml/serving` returns `TipCandidate[]` from the `tip-generator` LiteLLM alias.
- **Policy registry** — `Policy.pick(candidates, context) → tip`. Named entries:
+3. **Scoring** — all candidates sent to `ml/serving` active policy (`POST /score/egreedy`). Falls back to random if `ml/serving` is unreachable.
-  - `random` — v0 (Phase 0).
+4. **Shadow policies** — active policy runs shadow policies in the same request for offline comparison (ADR-0002). Currently: `egreedy-v2` shadows `egreedy-v1`.
-  - `bandit.linucb.pooled` — v1 (Phase 1). **Global-then-personalize**: pooled features shared across users; per-user residual once data allows.
+5. **Persistence** — `tipViews` + `tipScores` rows written on every serve; `tipFeedback` row on reaction.
-  - `remote` — delegates to `ml/serving` FastAPI scorer (Phase 1+).
+6. **Reward delivery** — reaction triggers `POST /reward/egreedy` on `ml/serving` with inferred reward value.
 - **Shadow hook** — every request optionally runs N shadow policies in parallel and logs their picks + estimated rewards. Promotion from shadow → A/B → launch is a separate, deliberate step (ADR-0002).
 - **TipInstance persistence** — every decision writes `context_snapshot` (features seen at decision time). This is what makes offline replay honest.
-## Phase 0 goal
+## Signal normalization
-`RandomPolicy` only. The service, contract, registry, shadow hook, and tip-instance persistence all exist; no ML yet.
+Signals carry `features: Record<string, number | boolean>` (bandit-ready) and `metadata: Record<string, unknown>` (source-specific raw fields). The bandit treats features as an opaque dict — sources own their feature names. See ADR-0009.
 ## Policy registry
 | Policy | Status | Notes |
 |--------|--------|-------|
 | `random` | Fallback | Used when ml/serving is unreachable |
 | `egreedy-v1` | Shadow | d=7, ADR-0007 |
 | `egreedy-v2` | **Active** | d=12 + profile features, ADR-0012 |
 Shadow → active promotion requires offline sim + online agreement (ADR-0002).
 ## Extraction criteria
 Extract to its own process at scaling hotspot: when `POST /recommend` p99 latency exceeds SLA or when recommendation CPU displaces API serving on shared host.
Author	SHA1	Message	Date
alvis	e40dfdcbb0	chore(infra): wire MLflow/Airflow env vars, fix healthcheck, add .dockerignore Some checks failed buf-check / Lint & breaking-change check (push) Has been cancelled Details - docker-compose: pass ML_SERVING_URL, MLFLOW_URL, AIRFLOW_URL + creds to api service - docker-compose: pass NEXT_PUBLIC_MLFLOW_URL/AIRFLOW_URL to admin service - docker-compose: replace wget healthcheck with node fetch (wget not in node image) - docker-compose: enable Airflow basic_auth API backend; add MLflow pip dep for DAGs - Dockerfiles: tighten layer caching, add .dockerignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-26 12:08:43 +00:00
alvis	bad1bb2cba	feat(simulate): MLflow tracking, Airflow DAG integration, health checks for mlflow/airflow - sim_runs schema: add judge_mode, n_policies, airflow_dag_run_id, mlflow_run_id columns - admin health endpoint: add mlflow + airflow checks (Basic auth for Airflow API) - admin nav: add Simulations page link; rename section label - runner.py: optional MLflow experiment tracking; multi-policy support - sim_dag.py: Airflow DAG for offline sim pipeline - admin simulate page + API client methods for sim runs - shared-types tsconfig: exclude test files from build Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-26 12:08:36 +00:00
alvis	e96ceb7ee1	feat(auth): token-based admin authentication for Playwright/CI (#105 ) Add POST /api/auth/token — validates ADMIN_TOKEN env var, creates a 24h session and sets the sid cookie so automated tools can access the admin panel without Google OAuth. Admin login page gains a token input form. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-26 12:07:43 +00:00
alvis	b554970032	docs(observability): add services/api README; update ml/serving + recommender docs (#18 ) - services/api/README.md: new — contract, middleware stack, background tasks, config table (LOG_LEVEL, SENTRY_DSN), health story, extraction criteria - ml/serving/README.md: add Observability section (structlog JSON, traceparent → trace_id binding), add SENTRY_DSN + ENV to config table - services/recommender/README.md: fix policy table — egreedy-v2 is active (#99), egreedy-v1 is shadow Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-26 03:41:39 +00:00
alvis	c4960d0601	feat(observability): structured logs, W3C trace IDs, Sentry hooks (#18 ) - TS: pino + pino-http; every HTTP request log includes traceId from W3C traceparent header (generated if absent); forwarded to ml/serving on all /score, /generate, /reward, and /api/ml proxy calls - Python: structlog JSON; FastAPI middleware binds trace_id via contextvars so every log line within a request carries it - Sentry: optional SENTRY_DSN init in both runtimes (no-op if unset) - Replace all console.* calls across services/api with pino logger - Update tests to spy on logger instead of console Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-26 03:37:28 +00:00
alvis	7281af83a4	feat(bandit): promote egreedy-v2 (D=12, profile features) as active policy (#99 ) Offline sim gate passed — egreedy-v2 mean reward −0.629 vs egreedy-v1 −0.642 (5 users × 20 rounds, rule judge, seed 42). v2 wins 3/5 personas. - recommender.ts: switch remotePolicy() to /score/egreedy/v2 - recommender.ts: switch sendRewardWithRetry() to /reward/egreedy/v2 with profile_features payload so the ridge update uses the full D=12 vector - recommender.ts: re-fetch profile at feedback time (TTL-cached, near-instant) - ADR-0012: status Accepted → Promoted, promotion record appended Shadow entry egreedy-v2-shadow kept in registry (active: false) for rollback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-26 03:08:28 +00:00
alvis	cba3f1a184	docs(services): update integrations + recommender READMEs for signal abstraction (#78 ) integrations/README — replace stale Connector interface and fictional libsodium vault with the actual SignalSource pattern, SQLite token table, and real OAuth routes. recommender/README — document the SignalAggregator pipeline, current policy registry, and actual /recommend + /feedback contract shapes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 17:17:38 +00:00
alvis	352469162d	fix(signals): add missing source field to TaskSyncedEvent (#78 ) TaskSyncedPayload in shared-types and ml/serving schemas both require source, but TaskSyncedEvent in bus.ts and the todoist publish call both omitted it — causing the JetStream consumer to nak every task.synced message on validation failure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 17:15:32 +00:00
alvis	45416000f9	feat(features): per-feature freshness spec — JIT vs batched (#61 ) Each ml/features/*.py now declares freshness, source, and fallback per feature. ProfileFeature gains ttl_sec (mirrored from registry.ts), freshness="batched", source, and fallback. context.py adds ContextFeatureSpec + CONTEXT_FEATURES for the three JIT features (hour_of_day, day_of_week, tasks). CI test parses ttlSec from registry.ts to catch drift. ml/README updated with split JIT/batched feature contract. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 17:02:55 +00:00
alvis	bd3ea1b8b1	docs(schema): update docs for #54 — proto registry + buf CI gate - packages/shared-types/README.md: new — documents HTTP vs event surfaces, proto file layout, schema evolution rules, and how to run buf locally - ml/serving/README.md: note pydantic payload validation in consumer section - CLAUDE.md: replace "schema registry enforced when #54 lands" with the actual state; remove #54 from active-work list Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:53:20 +00:00
alvis	377373a95d	test(schema): unit tests for schemas.py and nats_consumer._handle (#54 ) 17 tests covering: pydantic model validation (all payload types, optional fields, invalid enum values, missing required fields), _handle write path for task_synced, validation errors surfaced through _make_handler causing nak instead of ack. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:51:15 +00:00
alvis	d539fde0c1	feat(schema): protobuf event registry + buf CI gate (#54 ) - Add proto schemas in packages/shared-types/events/ (oo.events.v1): envelope.proto, signals.proto, integration.proto - buf.yaml with STANDARD lint + FILE breaking-change rules - .gitea/workflows/buf-check.yaml: lint + breaking check on every PR touching events/ (needs a Gitea Actions runner to execute) - scripts/buf-check.sh: local equivalent of the CI check - NormalizedEvent TS envelope gains eventId, schemaVersion, producer to align with the proto Envelope message - ml/serving/schemas.py: pydantic models mirroring the v1 proto types - nats_consumer.py: validate payloads via pydantic instead of raw .get() A field-rename PR will now fail buf breaking with exit code 100 and show the offending messages. To make a breaking change: keep the old field reserved, add the new one, bump schema_version to v2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 16:48:24 +00:00
alvis	f48b5a7646	docs(ml): serving README + update ml/README and CLAUDE.md for #98 - ml/serving/README.md: new — contract, JetStream consumer docs, config, health story, extraction criteria, state file reference - ml/README.md: note JetStream consumers in serving/ row - CLAUDE.md: update active work to reflect #98 shipped, #99 still pending Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 10:21:40 +00:00
alvis	4652e4b582	feat(ml): JetStream durable consumers in ml/serving (#98 ) Adds a NATS JetStream consumer to ml/serving so the feature pipeline can react to events without the API triggering every read. - nats_consumer.py: durable push consumers for signals.> and feedback.> streams; acks on success, naks for redeliver, up to NATS_MAX_DELIVER attempts; per-consumer health state (last_msg_ts, processed, errors) - main.py: FastAPI lifespan wires start/stop; /health exposes nats state - requirements.txt: adds nats-py>=2.9.0 - Dockerfile.ml: copy all *.py from ml/serving (was missing prompts.py) Handled subjects: signals.task.synced → writes per-user sync metadata to STATE_DIR signals.tip.feedback → logged for observability (reward via HTTP path) Config: NATS_URL (empty = disabled), NATS_DURABLE_PREFIX, NATS_MAX_DELIVER Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 10:19:47 +00:00