Compare commits
14 Commits
2d7cf217a9
...
e40dfdcbb0
| Author | SHA1 | Date | |
|---|---|---|---|
| e40dfdcbb0 | |||
| bad1bb2cba | |||
| e96ceb7ee1 | |||
| b554970032 | |||
| c4960d0601 | |||
| 7281af83a4 | |||
| cba3f1a184 | |||
| 352469162d | |||
| 45416000f9 | |||
| bd3ea1b8b1 | |||
| 377373a95d | |||
| d539fde0c1 | |||
| f48b5a7646 | |||
| 4652e4b582 |
19
.dockerignore
Normal file
19
.dockerignore
Normal file
@@ -0,0 +1,19 @@
|
||||
**/node_modules
|
||||
**/.next
|
||||
**/dist
|
||||
**/coverage
|
||||
**/.vitest-cache
|
||||
**/.turbo
|
||||
.git
|
||||
.gitea
|
||||
.github
|
||||
.vscode
|
||||
.idea
|
||||
**/.env
|
||||
**/.env.local
|
||||
**/*.log
|
||||
docs
|
||||
infra/docker/data
|
||||
**/__tests__
|
||||
**/*.test.ts
|
||||
**/*.test.tsx
|
||||
26
.env.example
26
.env.example
@@ -10,6 +10,32 @@ API_BASE_URL=http://localhost:3078
|
||||
WEB_BASE_URL=http://localhost:3000
|
||||
ML_SERVING_URL=http://localhost:8000
|
||||
|
||||
# MLflow (mlops profile) — http://localhost:5000/mlflow in dev, https://o.alogins.net/mlflow in prod.
|
||||
# MLFLOW_ADMIN_PASSWORD seeds the admin account on first boot (changing it after first run
|
||||
# requires the MLflow UI or API — see infra/mlflow/basic_auth.ini).
|
||||
MLFLOW_URL=http://localhost:5000
|
||||
MLFLOW_ADMIN_PASSWORD=change-me
|
||||
# Public URL shown as link in the admin sidebar (must be NEXT_PUBLIC_ to reach the browser).
|
||||
NEXT_PUBLIC_MLFLOW_URL=http://localhost:5000
|
||||
|
||||
# Airflow (mlops profile) — http://localhost:8080/airflow in dev.
|
||||
# Start with: docker compose --profile full --profile mlops up
|
||||
AIRFLOW_URL=http://localhost:8080
|
||||
AIRFLOW_ADMIN_PASSWORD=change-me
|
||||
AIRFLOW_DB_PASSWORD=airflow
|
||||
AIRFLOW_SECRET_KEY=change-me-in-prod
|
||||
AIRFLOW_FERNET_KEY=
|
||||
AIRFLOW_BASE_URL=https://o.alogins.net/airflow
|
||||
# Public URL shown as link in the admin sidebar (must be NEXT_PUBLIC_ to reach the browser).
|
||||
NEXT_PUBLIC_AIRFLOW_URL=http://localhost:8080
|
||||
|
||||
# Shared secret for Airflow→API internal callbacks. Generate: openssl rand -hex 32
|
||||
INTERNAL_API_TOKEN=
|
||||
|
||||
# Static token for automated/service access to the admin panel (e.g. Playwright tests).
|
||||
# Leave empty to disable token-based login. Generate: openssl rand -hex 32
|
||||
ADMIN_TOKEN=
|
||||
|
||||
# AI stack — shared Agap services (ollama + litellm + langfuse). Not run from oO.
|
||||
# Prod: https://llm.alogins.net | Dev: http://host.docker.internal:4000 from containers,
|
||||
# http://localhost:4000 from host. Ollama: http://host.docker.internal:11434 / :11434.
|
||||
|
||||
37
.gitea/workflows/buf-check.yaml
Normal file
37
.gitea/workflows/buf-check.yaml
Normal file
@@ -0,0 +1,37 @@
|
||||
name: buf-check
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
paths:
|
||||
- 'packages/shared-types/events/**'
|
||||
pull_request:
|
||||
paths:
|
||||
- 'packages/shared-types/events/**'
|
||||
|
||||
jobs:
|
||||
buf:
|
||||
name: Lint & breaking-change check
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
fetch-depth: 0
|
||||
|
||||
- name: Install buf
|
||||
run: |
|
||||
BUF_VERSION=1.50.0
|
||||
curl -sSfL \
|
||||
"https://github.com/bufbuild/buf/releases/download/v${BUF_VERSION}/buf-Linux-x86_64" \
|
||||
-o /usr/local/bin/buf
|
||||
chmod +x /usr/local/bin/buf
|
||||
buf --version
|
||||
|
||||
- name: buf lint
|
||||
run: buf lint packages/shared-types/events
|
||||
|
||||
- name: buf breaking
|
||||
if: github.event_name == 'pull_request'
|
||||
run: |
|
||||
buf breaking packages/shared-types/events \
|
||||
--against ".git#branch=${{ github.base_ref }},subdir=packages/shared-types/events"
|
||||
14
CLAUDE.md
14
CLAUDE.md
@@ -56,7 +56,7 @@ docs/ architecture notes, ADRs, API specs
|
||||
## Contracts between modules
|
||||
|
||||
- **HTTP** (OpenAPI, in `packages/shared-types/http/`) — synchronous request/response. In-process today; over the network once extracted. Signatures are identical.
|
||||
- **Events** (Protocol Buffers, in `packages/shared-types/events/`) — durable signals + feedback. Today: in-process `Bus` with a `onPublish` bridge to NATS JetStream when `NATS_URL` is set (ADR-0010). The in-proc bus stays the source of truth — JetStream is the durable mirror that cross-process consumers (`ml/serving`, future feature pipelines) tail. Schema registry enforced in CI when #54 lands; until then payloads are JSON envelopes (ADR-0005).
|
||||
- **Events** (Protocol Buffers, in `packages/shared-types/events/`) — durable signals + feedback. Today: in-process `Bus` with a `onPublish` bridge to NATS JetStream when `NATS_URL` is set (ADR-0010). The in-proc bus stays the source of truth — JetStream is the durable mirror that cross-process consumers (`ml/serving`, future feature pipelines) tail. Proto schemas (ADR-0005) live in `packages/shared-types/events/oo/events/v1/`; `buf lint` + `buf breaking` run in CI on every PR touching those files (`.gitea/workflows/buf-check.yaml`).
|
||||
- Do not redefine types per module. Regenerate from `shared-types`.
|
||||
|
||||
## Conventions
|
||||
@@ -100,7 +100,7 @@ Ollama and LiteLLM are **shared Agap services**, not oO services — they live i
|
||||
|
||||
**M1 shipped. M2 (AI tips) in progress.** See `README.md` for the phase roadmap and `docs/architecture/` for diagrams. Work is tracked as Gitea milestones + issues on `alvis/oO`.
|
||||
|
||||
Active work: AI tip generation pipeline — issues #86–#93 in M2 milestone.
|
||||
Active work: bandit promotion (#99 — offline sim + ADR-0012 pending) and M2 issues (#61 freshness SLAs, #78 signal abstraction, #93 model benchmark).
|
||||
|
||||
## What NOT to do
|
||||
|
||||
@@ -112,3 +112,13 @@ Active work: AI tip generation pipeline — issues #86–#93 in M2 milestone.
|
||||
- Don't call LLMs directly from application code. All LLM calls go through `ml/serving` (Python) via `LITELLM_URL`. The TS recommender never holds a model name.
|
||||
- Don't embed MLflow/Airflow/OpenWebUI in the admin panel. They are external services; link out to them. The admin shell links to `o.alogins.net/mlflow`, `/airflow`, `ai.alogins.net`.
|
||||
- Don't `nats.publish()` directly from feature code. All publishes go through the in-process `Bus` (`services/api/src/events/bus.ts`); the NATS adapter (`events/nats.ts`) bridges every publish to JetStream when `NATS_URL` is set. This keeps subscribers, the ring-buffer tail used by the admin event viewer, and JetStream all in lockstep.
|
||||
|
||||
## Admin app
|
||||
|
||||
`apps/admin` rewrites `/api/*` → `$NEXT_PUBLIC_API_URL/api/*` via `next.config.ts`. So `apiFetch('/admin/stats')` in `apps/admin/src/lib/api.ts` hits the Express backend, not a Next.js route.
|
||||
|
||||
Running `tsc --noEmit -p apps/admin/tsconfig.json` always reports `Cannot find module 'next'` errors — expected outside the Next.js build context; use `next build` for real type errors.
|
||||
|
||||
## Auth / session pattern
|
||||
|
||||
Sessions use an `sid` cookie. Admin routes stack `requireAuth` (sets `req.userId`) then `requireAdmin` (checks `role = 'admin'` in DB). Token-based admin auth: `POST /api/auth/token` with `{ token }` matching `ADMIN_TOKEN` env var sets the `sid` cookie — used by Playwright and CI.
|
||||
|
||||
@@ -8,6 +8,15 @@ Next.js 15 app. Deployed at `admin.o.alogins.net` (dev: `http://localhost:3080`)
|
||||
and checks `role === 'admin'`. First admin is seeded via `ADMIN_SEED_EMAIL` env var at API startup.
|
||||
- Admin write actions are appended to the `admin_actions` audit log in the DB.
|
||||
|
||||
## Authentication
|
||||
|
||||
Two ways to sign in:
|
||||
|
||||
| Method | How |
|
||||
|--------|-----|
|
||||
| Google OAuth | Click "Sign in with Google" on the login page |
|
||||
| Token | `POST /api/auth/token` with `{ token }` matching `ADMIN_TOKEN` env var; sets `sid` cookie valid for 24 h. Used by Playwright tests and CI automation. |
|
||||
|
||||
## Pages
|
||||
|
||||
| Route | Description |
|
||||
|
||||
@@ -1,15 +1,67 @@
|
||||
'use client';
|
||||
|
||||
import { useState } from 'react';
|
||||
import { useRouter } from 'next/navigation';
|
||||
|
||||
export default function LoginPage() {
|
||||
const router = useRouter();
|
||||
const [token, setToken] = useState('');
|
||||
const [error, setError] = useState('');
|
||||
const [loading, setLoading] = useState(false);
|
||||
|
||||
async function handleTokenLogin(e: React.FormEvent) {
|
||||
e.preventDefault();
|
||||
setError('');
|
||||
setLoading(true);
|
||||
try {
|
||||
const res = await fetch('/api/auth/token', {
|
||||
method: 'POST',
|
||||
credentials: 'include',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ token }),
|
||||
});
|
||||
if (!res.ok) {
|
||||
const data = await res.json().catch(() => ({}));
|
||||
setError((data as { error?: string }).error ?? 'Invalid token');
|
||||
return;
|
||||
}
|
||||
router.push('/');
|
||||
} catch {
|
||||
setError('Request failed');
|
||||
} finally {
|
||||
setLoading(false);
|
||||
}
|
||||
}
|
||||
|
||||
return (
|
||||
<div className="flex min-h-screen items-center justify-center">
|
||||
<div className="text-center space-y-4">
|
||||
<div className="text-center space-y-6 w-72">
|
||||
<h1 className="text-2xl font-semibold">oO Admin</h1>
|
||||
<p className="text-gray-400 text-sm">Sign in via the main app first, then return here.</p>
|
||||
|
||||
<a
|
||||
href="/sign-in"
|
||||
className="inline-block px-4 py-2 bg-white text-black rounded text-sm font-medium hover:bg-gray-200 transition-colors"
|
||||
>
|
||||
Sign in with Google
|
||||
</a>
|
||||
|
||||
<form onSubmit={handleTokenLogin} className="space-y-3">
|
||||
<input
|
||||
type="password"
|
||||
placeholder="Admin token"
|
||||
value={token}
|
||||
onChange={(e) => setToken(e.target.value)}
|
||||
className="w-full px-3 py-2 bg-gray-900 border border-gray-700 rounded text-sm focus:outline-none focus:border-gray-500"
|
||||
/>
|
||||
{error && <p className="text-red-400 text-xs">{error}</p>}
|
||||
<button
|
||||
type="submit"
|
||||
disabled={loading || !token}
|
||||
className="w-full px-4 py-2 bg-gray-700 text-white rounded text-sm font-medium hover:bg-gray-600 disabled:opacity-40 transition-colors"
|
||||
>
|
||||
{loading ? 'Signing in…' : 'Sign in with token'}
|
||||
</button>
|
||||
</form>
|
||||
</div>
|
||||
</div>
|
||||
);
|
||||
|
||||
220
apps/admin/src/app/simulate/page.tsx
Normal file
220
apps/admin/src/app/simulate/page.tsx
Normal file
@@ -0,0 +1,220 @@
|
||||
'use client';
|
||||
|
||||
import { useEffect, useState } from 'react';
|
||||
import { AdminShell } from '@/components/AdminShell';
|
||||
import {
|
||||
startSimulation,
|
||||
getSimulationRuns,
|
||||
getSimulationRun,
|
||||
SimRun,
|
||||
} from '@/lib/api';
|
||||
|
||||
const POLICIES = ['linucb-v1', 'egreedy-v1', 'egreedy-v2'];
|
||||
const mlflowBase = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
|
||||
const airflowBase = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
|
||||
|
||||
function mlflowRunUrl(runId: string) {
|
||||
return `${mlflowBase}/#/experiments/1/runs/${runId}`;
|
||||
}
|
||||
|
||||
function airflowRunUrl(dagRunId: string) {
|
||||
return `${airflowBase}/dags/bandit_sim/grid?dag_run_id=${encodeURIComponent(dagRunId)}`;
|
||||
}
|
||||
|
||||
function StatusBadge({ status }: { status: string }) {
|
||||
const cls: Record<string, string> = {
|
||||
running: 'bg-blue-900 text-blue-300 border-blue-800',
|
||||
done: 'bg-green-900 text-green-300 border-green-800',
|
||||
failed: 'bg-red-900 text-red-300 border-red-800',
|
||||
pending: 'bg-gray-800 text-gray-400 border-gray-700',
|
||||
};
|
||||
return (
|
||||
<span className={`text-xs px-2 py-0.5 rounded border ${cls[status] ?? cls.pending}`}>
|
||||
{status}
|
||||
</span>
|
||||
);
|
||||
}
|
||||
|
||||
function SummaryRow({ run }: { run: SimRun }) {
|
||||
const summary = run.summaryJson ? JSON.parse(run.summaryJson) as Record<string, { total_reward: number; mean_reward: number; n_pulls: number }> : null;
|
||||
return (
|
||||
<div className="bg-gray-900 border border-gray-800 rounded p-4 space-y-2">
|
||||
<div className="flex items-center justify-between">
|
||||
<div className="space-y-0.5">
|
||||
<div className="flex items-center gap-2">
|
||||
<span className="font-mono text-xs text-gray-500">{run.id}</span>
|
||||
<StatusBadge status={run.status} />
|
||||
{run.winner && <span className="text-xs text-indigo-400">winner: {run.winner}</span>}
|
||||
</div>
|
||||
<div className="text-xs text-gray-600">
|
||||
{run.nUsers}u × {run.nRounds}r × {run.tasksPerRound}t/r — {run.judgeMode} judge
|
||||
{' · '}{new Date(run.createdAt).toLocaleString()}
|
||||
</div>
|
||||
</div>
|
||||
<div className="flex items-center gap-2 flex-shrink-0">
|
||||
{run.mlflowRunId && (
|
||||
<a href={mlflowRunUrl(run.mlflowRunId)} target="_blank" rel="noreferrer"
|
||||
className="text-xs text-indigo-400 hover:underline">MLflow ↗</a>
|
||||
)}
|
||||
{run.airflowDagRunId && (
|
||||
<a href={airflowRunUrl(run.airflowDagRunId)} target="_blank" rel="noreferrer"
|
||||
className="text-xs text-indigo-400 hover:underline">Airflow ↗</a>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
{summary && (
|
||||
<div className="grid grid-cols-2 gap-2 pt-1 lg:grid-cols-3">
|
||||
{Object.entries(summary).map(([policy, s]) => (
|
||||
<div key={policy} className={`rounded border p-2 text-xs ${policy === run.winner ? 'border-indigo-700 bg-indigo-950' : 'border-gray-800'}`}>
|
||||
<div className="font-mono font-medium text-gray-300 mb-1">{policy}</div>
|
||||
<div className="text-gray-500 space-y-0.5">
|
||||
<div>total <span className="text-gray-300">{s.total_reward.toFixed(2)}</span></div>
|
||||
<div>mean <span className="text-gray-300">{s.mean_reward.toFixed(4)}</span></div>
|
||||
<div>pulls <span className="text-gray-300">{s.n_pulls}</span></div>
|
||||
</div>
|
||||
</div>
|
||||
))}
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
);
|
||||
}
|
||||
|
||||
export default function SimulatePage() {
|
||||
const [runs, setRuns] = useState<SimRun[]>([]);
|
||||
const [loading, setLoading] = useState(true);
|
||||
const [launching, setLaunching] = useState(false);
|
||||
const [error, setError] = useState('');
|
||||
const [msg, setMsg] = useState('');
|
||||
|
||||
const [nUsers, setNUsers] = useState(5);
|
||||
const [nRounds, setNRounds] = useState(20);
|
||||
const [tasksPerRound, setTasksPerRound] = useState(8);
|
||||
const [judgeMode, setJudgeMode] = useState<'rule' | 'llm'>('rule');
|
||||
const [selectedPolicies, setSelectedPolicies] = useState<string[]>(['linucb-v1', 'egreedy-v1']);
|
||||
|
||||
const refresh = () =>
|
||||
getSimulationRuns()
|
||||
.then((r) => setRuns(r.runs))
|
||||
.catch((e) => setError(e.message))
|
||||
.finally(() => setLoading(false));
|
||||
|
||||
useEffect(() => {
|
||||
refresh();
|
||||
const t = setInterval(refresh, 8_000);
|
||||
return () => clearInterval(t);
|
||||
}, []);
|
||||
|
||||
const togglePolicy = (p: string) =>
|
||||
setSelectedPolicies((prev) =>
|
||||
prev.includes(p) ? prev.filter((x) => x !== p) : [...prev, p],
|
||||
);
|
||||
|
||||
const handleLaunch = async () => {
|
||||
if (selectedPolicies.length < 2) { setError('Select at least 2 policies.'); return; }
|
||||
setLaunching(true); setError(''); setMsg('');
|
||||
try {
|
||||
const r = await startSimulation({ nUsers, nRounds, tasksPerRound, judgeMode, policies: selectedPolicies });
|
||||
setMsg(r.airflow_dag_run_id
|
||||
? `Launched via Airflow — dag_run_id: ${r.airflow_dag_run_id}`
|
||||
: `Launched locally — run id: ${r.id}`);
|
||||
await refresh();
|
||||
} catch (e: unknown) {
|
||||
setError((e as Error).message);
|
||||
} finally {
|
||||
setLaunching(false);
|
||||
}
|
||||
};
|
||||
|
||||
return (
|
||||
<AdminShell>
|
||||
<div className="space-y-8 max-w-4xl">
|
||||
<h1 className="text-xl font-semibold">Simulations</h1>
|
||||
{error && <p className="text-red-400 text-sm">{error}</p>}
|
||||
{msg && <p className="text-green-400 text-sm">{msg}</p>}
|
||||
|
||||
{/* Launch form */}
|
||||
<section className="bg-gray-900 border border-gray-800 rounded p-5 space-y-4">
|
||||
<h2 className="text-base font-medium text-gray-300">New simulation</h2>
|
||||
|
||||
<div className="grid grid-cols-3 gap-4 text-sm">
|
||||
<label className="space-y-1">
|
||||
<span className="text-gray-500">Users</span>
|
||||
<input type="number" min={1} max={50} value={nUsers}
|
||||
onChange={(e) => setNUsers(Number(e.target.value))}
|
||||
className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
|
||||
</label>
|
||||
<label className="space-y-1">
|
||||
<span className="text-gray-500">Rounds</span>
|
||||
<input type="number" min={1} max={200} value={nRounds}
|
||||
onChange={(e) => setNRounds(Number(e.target.value))}
|
||||
className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
|
||||
</label>
|
||||
<label className="space-y-1">
|
||||
<span className="text-gray-500">Tasks/round</span>
|
||||
<input type="number" min={1} max={20} value={tasksPerRound}
|
||||
onChange={(e) => setTasksPerRound(Number(e.target.value))}
|
||||
className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
|
||||
</label>
|
||||
</div>
|
||||
|
||||
<div className="space-y-1 text-sm">
|
||||
<span className="text-gray-500">Policies (select ≥ 2)</span>
|
||||
<div className="flex gap-2 flex-wrap pt-1">
|
||||
{POLICIES.map((p) => (
|
||||
<button key={p} onClick={() => togglePolicy(p)}
|
||||
className={`px-3 py-1 rounded border text-xs font-mono ${
|
||||
selectedPolicies.includes(p)
|
||||
? 'bg-indigo-900 border-indigo-700 text-indigo-200'
|
||||
: 'border-gray-700 text-gray-500 hover:border-gray-500'
|
||||
}`}>
|
||||
{p}
|
||||
</button>
|
||||
))}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div className="space-y-1 text-sm">
|
||||
<span className="text-gray-500">Judge</span>
|
||||
<div className="flex gap-2 pt-1">
|
||||
{(['rule', 'llm'] as const).map((m) => (
|
||||
<button key={m} onClick={() => setJudgeMode(m)}
|
||||
className={`px-3 py-1 rounded border text-xs ${
|
||||
judgeMode === m
|
||||
? 'bg-gray-700 border-gray-500 text-white'
|
||||
: 'border-gray-700 text-gray-500 hover:border-gray-500'
|
||||
}`}>
|
||||
{m}
|
||||
</button>
|
||||
))}
|
||||
</div>
|
||||
{judgeMode === 'llm' && (
|
||||
<p className="text-xs text-yellow-600 mt-1">LLM judge requires ANTHROPIC_API_KEY in ml/serving env.</p>
|
||||
)}
|
||||
</div>
|
||||
|
||||
<button onClick={handleLaunch} disabled={launching}
|
||||
className="bg-indigo-600 hover:bg-indigo-500 disabled:opacity-50 text-white rounded px-4 py-2 text-sm">
|
||||
{launching ? 'Launching…' : 'Launch simulation'}
|
||||
</button>
|
||||
<p className="text-xs text-gray-600">
|
||||
Runs via <a href={airflowBase} target="_blank" rel="noreferrer" className="text-indigo-500 hover:underline">Airflow</a> (mlops profile) when available; falls back to local subprocess.
|
||||
Results logged to <a href={mlflowBase} target="_blank" rel="noreferrer" className="text-indigo-500 hover:underline">MLflow</a>.
|
||||
</p>
|
||||
</section>
|
||||
|
||||
{/* Run history */}
|
||||
<section className="space-y-3">
|
||||
<h2 className="text-base font-medium text-gray-300">
|
||||
Run history
|
||||
{loading && <span className="text-xs text-gray-600 ml-2">loading…</span>}
|
||||
</h2>
|
||||
{runs.length === 0 && !loading && (
|
||||
<p className="text-gray-600 text-sm">No simulations yet.</p>
|
||||
)}
|
||||
{runs.map((r) => <SummaryRow key={r.id} run={r} />)}
|
||||
</section>
|
||||
</div>
|
||||
</AdminShell>
|
||||
);
|
||||
}
|
||||
@@ -2,6 +2,7 @@
|
||||
|
||||
import Link from 'next/link';
|
||||
import { usePathname } from 'next/navigation';
|
||||
import { useEffect, useState } from 'react';
|
||||
|
||||
const mlflowUrl = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
|
||||
const airflowUrl = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
|
||||
@@ -10,6 +11,7 @@ type NavItem = {
|
||||
href: string;
|
||||
label: string;
|
||||
external?: boolean;
|
||||
svcName?: string; // key in the health services map
|
||||
};
|
||||
|
||||
type NavSection = {
|
||||
@@ -31,10 +33,11 @@ const NAV: NavSection[] = [
|
||||
],
|
||||
},
|
||||
{
|
||||
label: 'Recommender status',
|
||||
label: 'Recommender',
|
||||
items: [
|
||||
{ href: '/tips', label: 'Tips' },
|
||||
{ href: '/reward-analytics', label: 'Rewards' },
|
||||
{ href: '/simulate', label: 'Simulations' },
|
||||
],
|
||||
},
|
||||
{
|
||||
@@ -50,14 +53,33 @@ const NAV: NavSection[] = [
|
||||
label: 'Resources',
|
||||
items: [
|
||||
{ href: '/docs', label: 'Docs' },
|
||||
{ href: mlflowUrl, label: 'MLflow ↗', external: true },
|
||||
{ href: airflowUrl, label: 'Airflow ↗', external: true },
|
||||
{ href: mlflowUrl, label: 'MLflow ↗', external: true, svcName: 'mlflow' },
|
||||
{ href: airflowUrl, label: 'Airflow ↗', external: true, svcName: 'airflow' },
|
||||
],
|
||||
},
|
||||
];
|
||||
|
||||
const STATUS_DOT: Record<string, string> = {
|
||||
ok: 'bg-green-500',
|
||||
degraded: 'bg-yellow-400',
|
||||
down: 'bg-red-500',
|
||||
};
|
||||
|
||||
export function AdminShell({ children }: { children: React.ReactNode }) {
|
||||
const pathname = usePathname();
|
||||
const [svcStatus, setSvcStatus] = useState<Record<string, string>>({});
|
||||
|
||||
useEffect(() => {
|
||||
fetch('/api/admin/health', { credentials: 'include' })
|
||||
.then((r) => r.json())
|
||||
.then((data: { services?: { name: string; status: string }[] }) => {
|
||||
const map: Record<string, string> = {};
|
||||
for (const s of data.services ?? []) map[s.name] = s.status;
|
||||
setSvcStatus(map);
|
||||
})
|
||||
.catch(() => {});
|
||||
}, []);
|
||||
|
||||
return (
|
||||
<div className="flex min-h-screen">
|
||||
{/* Sidebar */}
|
||||
@@ -83,13 +105,19 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
|
||||
const active =
|
||||
!item.external &&
|
||||
(item.href === '/' ? pathname === '/' : pathname.startsWith(item.href));
|
||||
const className = `flex items-center px-3 py-2 rounded text-sm transition-colors ${
|
||||
const className = `flex items-center gap-2 px-3 py-2 rounded text-sm transition-colors ${
|
||||
active
|
||||
? 'bg-gray-800 text-white font-medium'
|
||||
: item.external
|
||||
? 'text-gray-500 hover:text-white hover:bg-gray-900'
|
||||
: 'text-gray-400 hover:text-white hover:bg-gray-900'
|
||||
}`;
|
||||
const dot = item.svcName
|
||||
? svcStatus[item.svcName]
|
||||
? <span className={`inline-block w-1.5 h-1.5 rounded-full flex-shrink-0 ${STATUS_DOT[svcStatus[item.svcName]] ?? STATUS_DOT.down}`} />
|
||||
: <span className="inline-block w-1.5 h-1.5 rounded-full flex-shrink-0 bg-gray-700" />
|
||||
: null;
|
||||
|
||||
return item.external ? (
|
||||
<a
|
||||
key={item.href}
|
||||
@@ -98,6 +126,7 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
|
||||
rel="noreferrer"
|
||||
className={className}
|
||||
>
|
||||
{dot}
|
||||
{item.label}
|
||||
</a>
|
||||
) : (
|
||||
|
||||
@@ -262,3 +262,49 @@ export function saveQuery(name: string, querySql: string) {
|
||||
export function deleteSavedQuery(id: string) {
|
||||
return apiFetch<{ ok: boolean }>(`/admin/saved-queries/${id}`, { method: 'DELETE' });
|
||||
}
|
||||
|
||||
// ── Simulations ────────────────────────────────────────────────────────────
|
||||
|
||||
export interface SimRun {
|
||||
id: string;
|
||||
policyA: string;
|
||||
policyB: string;
|
||||
nUsers: number;
|
||||
nRounds: number;
|
||||
tasksPerRound: number;
|
||||
judgeMode: string;
|
||||
nPolicies: number;
|
||||
status: 'pending' | 'running' | 'done' | 'failed';
|
||||
summaryJson: string | null;
|
||||
winner: string | null;
|
||||
personaBreakdownJson: string | null;
|
||||
airflowDagRunId: string | null;
|
||||
mlflowRunId: string | null;
|
||||
createdAt: string;
|
||||
finishedAt: string | null;
|
||||
}
|
||||
|
||||
export interface SimStartRequest {
|
||||
nUsers?: number;
|
||||
nRounds?: number;
|
||||
tasksPerRound?: number;
|
||||
judgeMode?: 'rule' | 'llm';
|
||||
policies?: string[];
|
||||
}
|
||||
|
||||
export function startSimulation(req: SimStartRequest) {
|
||||
return apiFetch<{ id: string; status: string; airflow_dag_run_id?: string }>(
|
||||
'/admin/simulate/start',
|
||||
{ method: 'POST', body: JSON.stringify(req) },
|
||||
);
|
||||
}
|
||||
|
||||
export function getSimulationRuns() {
|
||||
return apiFetch<{ runs: SimRun[] }>('/admin/simulate/runs');
|
||||
}
|
||||
|
||||
export function getSimulationRun(id: string) {
|
||||
return apiFetch<{ run: SimRun & { isRunning: boolean }; events: unknown[] }>(
|
||||
`/admin/simulate/${id}`,
|
||||
);
|
||||
}
|
||||
|
||||
File diff suppressed because one or more lines are too long
@@ -1,7 +1,7 @@
|
||||
# ADR-0012 — ε-greedy v2: profile features in the bandit (D=7→12)
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2026-04-25
|
||||
**Status:** Promoted
|
||||
**Date:** 2026-04-25 (accepted) / 2026-04-26 (promoted)
|
||||
**Issue:** #99
|
||||
|
||||
## Context
|
||||
@@ -106,3 +106,19 @@ projecting theta without the corresponding `A` matrix cannot be done correctly.
|
||||
the D=12 target in the issue spec and complicates the sim comparison. Deferred.
|
||||
|
||||
**In-place v1 promotion without shadow** — violates ADR-0002.
|
||||
|
||||
## Promotion record (2026-04-26)
|
||||
|
||||
Offline sim (`runner.py --policies egreedy-v1 egreedy-v2 --judge rule --n-users 5 --n-rounds 20 --seed 42`):
|
||||
|
||||
| policy | total reward | mean reward | pulls |
|
||||
|--------|-------------|-------------|-------|
|
||||
| egreedy-v1 | −64.20 | −0.6420 | 100 |
|
||||
| egreedy-v2 | −62.90 | −0.6290 | 100 |
|
||||
|
||||
**Gate passed** (v2 mean ≥ v1 mean). Per-persona: v2 wins deadline-driven, evening-relaxed, low-priority-first; v1 wins consistent-responder, overdue-ignorer.
|
||||
|
||||
Changes applied:
|
||||
- `recommender.ts` `remotePolicy()`: `/score/egreedy` → `/score/egreedy/v2`
|
||||
- `recommender.ts` `sendRewardWithRetry()`: `/reward/egreedy` → `/reward/egreedy/v2`, added `profile_features` to payload
|
||||
- Shadow entry `egreedy-v2-shadow` left in registry (`active: false`) for rollback.
|
||||
|
||||
@@ -1,21 +1,22 @@
|
||||
FROM node:22-alpine AS base
|
||||
RUN npm install -g pnpm
|
||||
# syntax=docker/dockerfile:1.7
|
||||
|
||||
FROM base AS deps
|
||||
WORKDIR /app
|
||||
COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
|
||||
COPY packages/shared-types/package.json ./packages/shared-types/
|
||||
COPY apps/admin/package.json ./apps/admin/
|
||||
RUN pnpm install --frozen-lockfile
|
||||
FROM node:22-slim AS base
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates \
|
||||
&& rm -rf /var/lib/apt/lists/* \
|
||||
&& npm install -g pnpm
|
||||
ENV CI=true \
|
||||
PNPM_HOME=/pnpm \
|
||||
PATH=/pnpm:$PATH
|
||||
RUN pnpm config set store-dir /pnpm/store
|
||||
|
||||
FROM base AS builder
|
||||
WORKDIR /app
|
||||
COPY --from=deps /app/node_modules ./node_modules
|
||||
COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules
|
||||
COPY --from=deps /app/apps/admin/node_modules ./apps/admin/node_modules
|
||||
COPY tsconfig.base.json ./
|
||||
COPY packages/shared-types ./packages/shared-types
|
||||
COPY apps/admin ./apps/admin
|
||||
COPY pnpm-lock.yaml ./
|
||||
RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm fetch
|
||||
COPY . .
|
||||
RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
|
||||
pnpm install --frozen-lockfile --offline \
|
||||
--filter @oo/admin... --filter @oo/shared-types
|
||||
RUN pnpm --filter @oo/shared-types build
|
||||
ARG NEXT_PUBLIC_MLFLOW_URL=/mlflow
|
||||
ARG NEXT_PUBLIC_AIRFLOW_URL=/airflow
|
||||
@@ -24,7 +25,7 @@ ENV NEXT_TELEMETRY_DISABLED=1 \
|
||||
NEXT_PUBLIC_AIRFLOW_URL=$NEXT_PUBLIC_AIRFLOW_URL
|
||||
RUN pnpm --filter @oo/admin build
|
||||
|
||||
FROM node:22-alpine AS runner
|
||||
FROM node:22-slim AS runner
|
||||
ENV NODE_ENV=production NEXT_TELEMETRY_DISABLED=1 PORT=3080
|
||||
WORKDIR /app
|
||||
COPY --from=builder /app/apps/admin/.next/standalone ./
|
||||
|
||||
@@ -1,32 +1,35 @@
|
||||
FROM node:22-alpine AS base
|
||||
RUN npm install -g pnpm
|
||||
# syntax=docker/dockerfile:1.7
|
||||
|
||||
FROM base AS deps
|
||||
WORKDIR /app
|
||||
COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
|
||||
COPY packages/shared-types/package.json ./packages/shared-types/
|
||||
COPY services/api/package.json ./services/api/
|
||||
RUN pnpm install --frozen-lockfile
|
||||
FROM node:22-slim AS base
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
python3 make g++ ca-certificates \
|
||||
&& rm -rf /var/lib/apt/lists/* \
|
||||
&& npm install -g pnpm
|
||||
ENV CI=true \
|
||||
PNPM_HOME=/pnpm \
|
||||
PATH=/pnpm:$PATH
|
||||
RUN pnpm config set store-dir /pnpm/store
|
||||
|
||||
FROM base AS builder
|
||||
WORKDIR /app
|
||||
COPY --from=deps /app/node_modules ./node_modules
|
||||
COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules
|
||||
COPY --from=deps /app/services/api/node_modules ./services/api/node_modules
|
||||
COPY tsconfig.base.json ./
|
||||
COPY packages/shared-types ./packages/shared-types
|
||||
COPY services/api ./services/api
|
||||
COPY pnpm-lock.yaml ./
|
||||
RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm fetch
|
||||
COPY . .
|
||||
RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
|
||||
pnpm install --frozen-lockfile --offline \
|
||||
--filter @oo/api... --filter @oo/shared-types
|
||||
RUN pnpm --filter @oo/shared-types build
|
||||
RUN pnpm --filter @oo/api build
|
||||
RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
|
||||
pnpm --filter @oo/api --prod deploy --legacy /deploy \
|
||||
&& cp -r services/api/dist /deploy/dist \
|
||||
&& rm -rf /deploy/node_modules/@oo/shared-types/src \
|
||||
&& cp -r packages/shared-types/dist /deploy/node_modules/@oo/shared-types/dist
|
||||
|
||||
FROM node:22-alpine AS runner
|
||||
FROM node:22-slim AS runner
|
||||
WORKDIR /app
|
||||
RUN npm install -g pnpm
|
||||
COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
|
||||
COPY packages/shared-types/package.json ./packages/shared-types/
|
||||
COPY services/api/package.json ./services/api/
|
||||
RUN pnpm install --prod --frozen-lockfile
|
||||
COPY --from=builder /app/packages/shared-types/dist ./packages/shared-types/dist
|
||||
COPY --from=builder /app/services/api/dist ./services/api/dist
|
||||
WORKDIR /app/services/api
|
||||
ENV NODE_ENV=production
|
||||
COPY --from=builder /deploy/package.json ./
|
||||
COPY --from=builder /deploy/node_modules ./node_modules
|
||||
COPY --from=builder /deploy/dist ./dist
|
||||
CMD ["node", "dist/index.js"]
|
||||
|
||||
@@ -2,5 +2,5 @@ FROM python:3.12-slim
|
||||
WORKDIR /app
|
||||
COPY ml/serving/requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
COPY ml/serving/main.py .
|
||||
COPY ml/serving/*.py .
|
||||
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||
|
||||
@@ -11,12 +11,18 @@ services:
|
||||
env_file: ../../.env.local
|
||||
environment:
|
||||
NODE_ENV: production
|
||||
ML_SERVING_URL: "http://ml-serving:8000"
|
||||
MLFLOW_URL: "http://mlflow:5000"
|
||||
AIRFLOW_URL: "http://airflow-webserver:8080"
|
||||
AIRFLOW_API_USER: "admin"
|
||||
AIRFLOW_API_PASSWORD: "${AIRFLOW_ADMIN_PASSWORD:-admin}"
|
||||
INTERNAL_API_TOKEN: "${INTERNAL_API_TOKEN:-}"
|
||||
volumes:
|
||||
- /mnt/ssd/dbs/oo:/mnt/ssd/dbs/oo
|
||||
ports:
|
||||
- "127.0.0.1:3078:3078"
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "--spider", "-q", "http://localhost:3078/health"]
|
||||
test: ["CMD", "node", "-e", "fetch('http://localhost:3078/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
@@ -49,6 +55,8 @@ services:
|
||||
PORT: "3080"
|
||||
HOSTNAME: "0.0.0.0"
|
||||
NEXT_PUBLIC_API_URL: ""
|
||||
NEXT_PUBLIC_MLFLOW_URL: "/mlflow"
|
||||
NEXT_PUBLIC_AIRFLOW_URL: "/airflow"
|
||||
INTERNAL_API_URL: "http://api:3078"
|
||||
ports:
|
||||
- "127.0.0.1:3080:3080"
|
||||
@@ -133,8 +141,14 @@ services:
|
||||
AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY:-change-me-in-prod}
|
||||
AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
|
||||
AIRFLOW__WEBSERVER__BASE_URL: ${AIRFLOW_BASE_URL:-https://o.alogins.net/airflow}
|
||||
AIRFLOW__API__AUTH_BACKENDS: "airflow.api.auth.backend.basic_auth"
|
||||
_PIP_ADDITIONAL_REQUIREMENTS: "mlflow==2.14.3 httpx"
|
||||
MLFLOW_TRACKING_URI: "http://mlflow:5000/mlflow"
|
||||
MLFLOW_TRACKING_USERNAME: "admin"
|
||||
MLFLOW_TRACKING_PASSWORD: "${MLFLOW_ADMIN_PASSWORD:-password}"
|
||||
volumes:
|
||||
- ../../ml/pipelines:/opt/airflow/dags:ro
|
||||
- ../../ml:/opt/airflow/ml:ro
|
||||
ports:
|
||||
- "127.0.0.1:8080:8080"
|
||||
depends_on:
|
||||
@@ -155,8 +169,13 @@ services:
|
||||
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${AIRFLOW_DB_PASSWORD:-airflow}@airflow-db/airflow
|
||||
AIRFLOW__CORE__EXECUTOR: LocalExecutor
|
||||
AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
|
||||
_PIP_ADDITIONAL_REQUIREMENTS: "mlflow==2.14.3 httpx"
|
||||
MLFLOW_TRACKING_URI: "http://mlflow:5000/mlflow"
|
||||
MLFLOW_TRACKING_USERNAME: "admin"
|
||||
MLFLOW_TRACKING_PASSWORD: "${MLFLOW_ADMIN_PASSWORD:-password}"
|
||||
volumes:
|
||||
- ../../ml/pipelines:/opt/airflow/dags:ro
|
||||
- ../../ml:/opt/airflow/ml:ro
|
||||
depends_on:
|
||||
airflow-init:
|
||||
condition: service_completed_successfully
|
||||
|
||||
24
ml/README.md
24
ml/README.md
@@ -4,8 +4,8 @@ Python. Owns models, features, training, online scoring.
|
||||
|
||||
| Dir | Role | Phase |
|
||||
|---|---|---|
|
||||
| `serving/` | FastAPI online scorer (`/score`, `/generate`) + LiteLLM gateway + prompt registry (`prompts.py`), called by `recommender` | 1–2 |
|
||||
| `features/` | context assembler (`context.py`): signals → `PromptContext`; Feast adapter later | 2 |
|
||||
| `serving/` | FastAPI online scorer (`/score`, `/generate`) + LiteLLM gateway + prompt registry (`prompts.py`) + JetStream consumers for `signals.>` / `feedback.>`, called by `recommender` | 1–2 |
|
||||
| `features/` | context assembler (`context.py`): signals → `PromptContext`; profile-feature schema mirror (`profile_schema.py`); Feast adapter later | 2 |
|
||||
| `pipelines/` | batch feature + training DAGs (Prefect/Airflow) | 4 |
|
||||
| `registry/` | MLflow-backed model registry integration | 4 |
|
||||
| `experiments/` | A/B assignment + multi-armed bandit policies | 4 |
|
||||
@@ -18,14 +18,24 @@ Python. Owns models, features, training, online scoring.
|
||||
- Training reads from the offline feature store; serving reads from the online feature store; definitions are shared (no train/serve skew).
|
||||
- Shadow deploys before any policy change that affects real users.
|
||||
|
||||
## Profile-feature contract
|
||||
## Feature contract
|
||||
|
||||
### Profile features (batched)
|
||||
|
||||
User-level features (completion rate, preferred hour, tip volume…) are computed
|
||||
by the TypeScript recommender and shipped to ml/serving on every `/score` and
|
||||
by the TypeScript recommender and shipped to `ml/serving` on every `/score` and
|
||||
`/generate` call as `profile_features: dict | None`. The Python mirror in
|
||||
`features/profile_schema.py` documents the available names + dtypes — keep it
|
||||
in sync with `services/api/src/profile/registry.ts` (a CI-style test asserts
|
||||
the name sets match). See ADR-0011.
|
||||
`features/profile_schema.py` documents each feature's name, dtype, TTL, source,
|
||||
and null fallback — keep it in sync with `services/api/src/profile/registry.ts`
|
||||
(a CI-style test asserts names and `ttlSec` values match). See ADR-0011.
|
||||
|
||||
### Context features (JIT)
|
||||
|
||||
Request-time signals assembled by `features/context.py` (`hour_of_day`,
|
||||
`day_of_week`, task list). These are never cached — they are derived from the
|
||||
system clock and the live Todoist feed at the moment of the score call.
|
||||
`CONTEXT_FEATURES` in `context.py` declares freshness, source, and fallback for
|
||||
each field (issue #61).
|
||||
|
||||
## Prompt registry
|
||||
|
||||
|
||||
@@ -26,6 +26,7 @@ from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import sys
|
||||
import time
|
||||
@@ -40,6 +41,12 @@ from llm_judge import ACTIONS, infer_reward, judge
|
||||
from personas import PERSONAS, Persona
|
||||
from task_generator import generate_task_pool
|
||||
|
||||
try:
|
||||
import mlflow
|
||||
_MLFLOW_AVAILABLE = True
|
||||
except ImportError:
|
||||
_MLFLOW_AVAILABLE = False
|
||||
|
||||
POLICY_SCORE_ENDPOINTS: dict[str, str] = {
|
||||
"linucb-v1": "/score",
|
||||
"egreedy-v1": "/score/egreedy",
|
||||
@@ -107,14 +114,30 @@ def _call_reward(
|
||||
|
||||
# ── Standard single-pass runner (rule / llm modes) ─────────────────────────
|
||||
|
||||
def _init_mlflow(mlflow_url: str | None, experiment: str) -> str | None:
|
||||
"""Set up MLflow tracking and return the active run_id, or None if unavailable."""
|
||||
if not _MLFLOW_AVAILABLE or not mlflow_url:
|
||||
return None
|
||||
try:
|
||||
mlflow.set_tracking_uri(mlflow_url)
|
||||
mlflow.set_experiment(experiment)
|
||||
return "ready"
|
||||
except Exception as e:
|
||||
print(f" [warn] MLflow init failed: {e}", file=sys.stderr)
|
||||
return None
|
||||
|
||||
|
||||
def run_simulation(
|
||||
n_users: int, n_rounds: int, tasks_per_round: int,
|
||||
ml_url: str, policies: list[str], use_llm: bool, seed: int,
|
||||
mlflow_url: str | None = None, mlflow_experiment: str = "bandit_simulation",
|
||||
) -> dict:
|
||||
rng = random.Random(seed)
|
||||
run_id = str(uuid.uuid4())[:8]
|
||||
started_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
||||
|
||||
_init_mlflow(mlflow_url, mlflow_experiment)
|
||||
|
||||
user_personas = [
|
||||
(f"sim-{run_id}-u{i}", PERSONAS[i % len(PERSONAS)])
|
||||
for i in range(n_users)
|
||||
@@ -130,6 +153,26 @@ def run_simulation(
|
||||
}
|
||||
events: list[dict] = []
|
||||
|
||||
mlflow_run_id: str | None = None
|
||||
mlflow_ctx = (
|
||||
mlflow.start_run(run_name=run_id)
|
||||
if (_MLFLOW_AVAILABLE and mlflow_url)
|
||||
else None
|
||||
)
|
||||
|
||||
try:
|
||||
if mlflow_ctx:
|
||||
active = mlflow_ctx.__enter__()
|
||||
mlflow_run_id = active.info.run_id
|
||||
mlflow.log_params({
|
||||
"n_users": n_users,
|
||||
"n_rounds": n_rounds,
|
||||
"tasks_per_round": tasks_per_round,
|
||||
"policies": ",".join(policies),
|
||||
"judge": "llm" if use_llm else "rule",
|
||||
"seed": seed,
|
||||
})
|
||||
|
||||
with httpx.Client(trust_env=False) as client:
|
||||
for rnd in range(n_rounds):
|
||||
hour = rng.randint(6, 22)
|
||||
@@ -139,8 +182,6 @@ def run_simulation(
|
||||
for user_id, persona in user_personas:
|
||||
seed_tasks = rnd * 997 + abs(hash(user_id)) % 997
|
||||
tasks = generate_task_pool(n=tasks_per_round, seed=seed_tasks)
|
||||
|
||||
# Per-persona profile features for v2 (synthetic for sim — see ADR-0012)
|
||||
profile = persona.profile_features(hour) if hasattr(persona, "profile_features") else None
|
||||
|
||||
for policy in policies:
|
||||
@@ -179,13 +220,34 @@ def run_simulation(
|
||||
prev = acc[p]["cumulative_rewards"][-1] if acc[p]["cumulative_rewards"] else 0.0
|
||||
acc[p]["cumulative_rewards"].append(prev + round_rewards[p])
|
||||
|
||||
if mlflow_ctx:
|
||||
for p in policies:
|
||||
mlflow.log_metric(f"{p}_cumulative_reward",
|
||||
acc[p]["cumulative_rewards"][-1], step=rnd)
|
||||
|
||||
mode = "llm" if use_llm else "rule"
|
||||
print(f" Round {rnd+1:>3}/{n_rounds} [{mode}] " + " ".join(
|
||||
f"{p}={acc[p]['cumulative_rewards'][-1]:+.2f}" for p in policies
|
||||
))
|
||||
|
||||
return _build_result(run_id, started_at, policies, acc, events,
|
||||
result = _build_result(run_id, started_at, policies, acc, events,
|
||||
n_users, n_rounds, tasks_per_round, use_llm, seed)
|
||||
result["mlflow_run_id"] = mlflow_run_id
|
||||
|
||||
if mlflow_ctx:
|
||||
for p, s in result["summary"].items():
|
||||
mlflow.log_metrics({
|
||||
f"{p}_total_reward": s["total_reward"],
|
||||
f"{p}_mean_reward": s["mean_reward"],
|
||||
f"{p}_n_pulls": s["n_pulls"],
|
||||
})
|
||||
mlflow.set_tag("winner", result["winner"])
|
||||
|
||||
return result
|
||||
|
||||
finally:
|
||||
if mlflow_ctx:
|
||||
mlflow_ctx.__exit__(None, None, None)
|
||||
|
||||
|
||||
# ── Claude Code judge — phase 1: score ─────────────────────────────────────
|
||||
@@ -494,6 +556,9 @@ if __name__ == "__main__":
|
||||
help="Alias for --judge rule (backwards compat)")
|
||||
parser.add_argument("--seed", type=int, default=42)
|
||||
parser.add_argument("--out", default=None)
|
||||
parser.add_argument("--mlflow-url", default=os.environ.get("MLFLOW_TRACKING_URI"),
|
||||
help="MLflow tracking URI (e.g. http://mlflow:5000/mlflow)")
|
||||
parser.add_argument("--mlflow-experiment", default="bandit_simulation")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.no_llm:
|
||||
@@ -534,6 +599,7 @@ if __name__ == "__main__":
|
||||
n_users=args.n_users, n_rounds=args.n_rounds,
|
||||
tasks_per_round=args.tasks_per_round, ml_url=args.ml_url,
|
||||
policies=args.policies, use_llm=use_llm, seed=args.seed,
|
||||
mlflow_url=args.mlflow_url, mlflow_experiment=args.mlflow_experiment,
|
||||
)
|
||||
Path(out_path).write_text(json.dumps(result, indent=2))
|
||||
print()
|
||||
|
||||
@@ -1,3 +1,8 @@
|
||||
from .context import build_context, PromptContext, TaskSignal
|
||||
from .context import build_context, PromptContext, TaskSignal, ContextFeatureSpec, CONTEXT_FEATURES
|
||||
from .profile_schema import ProfileFeature, PROFILE_FEATURES, feature_names
|
||||
|
||||
__all__ = ["build_context", "PromptContext", "TaskSignal"]
|
||||
__all__ = [
|
||||
"build_context", "PromptContext", "TaskSignal",
|
||||
"ContextFeatureSpec", "CONTEXT_FEATURES",
|
||||
"ProfileFeature", "PROFILE_FEATURES", "feature_names",
|
||||
]
|
||||
|
||||
@@ -2,12 +2,56 @@
|
||||
Context assembler — converts raw user signals into a PromptContext for LLM tip generation.
|
||||
|
||||
Usage:
|
||||
from ml.features.context import build_context
|
||||
from ml.features.context import build_context, CONTEXT_FEATURES
|
||||
ctx = build_context(tasks, hour_of_day=9, day_of_week=2)
|
||||
|
||||
Feature-spec (issue #61):
|
||||
All context features are JIT — they are assembled at request time from live
|
||||
sources (system clock, caller-supplied task list) rather than read from a
|
||||
cached profile store. They carry no TTL because they are never persisted.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Literal
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class ContextFeatureSpec:
|
||||
name: str
|
||||
dtype: Literal["numeric", "categorical", "list"]
|
||||
freshness: Literal["jit", "batched"]
|
||||
source: str
|
||||
fallback: str
|
||||
description: str
|
||||
|
||||
|
||||
CONTEXT_FEATURES: tuple[ContextFeatureSpec, ...] = (
|
||||
ContextFeatureSpec(
|
||||
name="hour_of_day",
|
||||
dtype="numeric",
|
||||
freshness="jit",
|
||||
source="request",
|
||||
fallback="12",
|
||||
description="Current hour (0–23), supplied by the caller at score time.",
|
||||
),
|
||||
ContextFeatureSpec(
|
||||
name="day_of_week",
|
||||
dtype="numeric",
|
||||
freshness="jit",
|
||||
source="request",
|
||||
fallback="0",
|
||||
description="ISO weekday (0=Monday … 6=Sunday), supplied by the caller at score time.",
|
||||
),
|
||||
ContextFeatureSpec(
|
||||
name="tasks",
|
||||
dtype="list",
|
||||
freshness="jit",
|
||||
source="todoist-integration",
|
||||
fallback="[]",
|
||||
description="User's open tasks fetched live from the Todoist integration at request time.",
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
|
||||
@@ -8,6 +8,12 @@ code (ml/serving, eval harnesses, notebooks) knows what fields to expect on
|
||||
|
||||
Update this file whenever you add or rename a feature in the TS registry.
|
||||
The accompanying test asserts the two stay in sync at the name level.
|
||||
|
||||
Feature-spec fields (issue #61):
|
||||
freshness — "batched": value cached in profile store, recomputed on TTL/event.
|
||||
ttl_sec — cache lifetime in seconds; mirrors ``ttlSec`` in registry.ts.
|
||||
source — where the value originates.
|
||||
fallback — raw value returned when the feature is unavailable (null stored).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
@@ -16,6 +22,10 @@ from typing import Literal
|
||||
|
||||
|
||||
Dtype = Literal["numeric", "categorical"]
|
||||
Freshness = Literal["jit", "batched"]
|
||||
|
||||
_HOUR = 3600
|
||||
_DAY = 86_400
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
@@ -23,28 +33,57 @@ class ProfileFeature:
|
||||
name: str
|
||||
dtype: Dtype
|
||||
description: str
|
||||
freshness: Freshness
|
||||
ttl_sec: int
|
||||
source: str
|
||||
fallback: str
|
||||
|
||||
|
||||
PROFILE_FEATURES: tuple[ProfileFeature, ...] = (
|
||||
ProfileFeature(
|
||||
"completion_rate_30d", "numeric",
|
||||
'Fraction of tips served in the last 30 days that received a "done" reaction.',
|
||||
name="completion_rate_30d",
|
||||
dtype="numeric",
|
||||
description='Fraction of tips served in the last 30 days that received a "done" reaction.',
|
||||
freshness="batched",
|
||||
ttl_sec=6 * _HOUR,
|
||||
source="profile_store",
|
||||
fallback="0.0",
|
||||
),
|
||||
ProfileFeature(
|
||||
"dismiss_rate_30d", "numeric",
|
||||
'Fraction of tips served in the last 30 days that received a "dismiss" reaction.',
|
||||
name="dismiss_rate_30d",
|
||||
dtype="numeric",
|
||||
description='Fraction of tips served in the last 30 days that received a "dismiss" reaction.',
|
||||
freshness="batched",
|
||||
ttl_sec=6 * _HOUR,
|
||||
source="profile_store",
|
||||
fallback="0.0",
|
||||
),
|
||||
ProfileFeature(
|
||||
"mean_dwell_ms_30d", "numeric",
|
||||
"Average dwell time (ms between served and reacted) over the last 30 days.",
|
||||
name="mean_dwell_ms_30d",
|
||||
dtype="numeric",
|
||||
description="Average dwell time (ms between served and reacted) over the last 30 days.",
|
||||
freshness="batched",
|
||||
ttl_sec=6 * _HOUR,
|
||||
source="profile_store",
|
||||
fallback="null — serving normalises to 0.0",
|
||||
),
|
||||
ProfileFeature(
|
||||
"preferred_hour", "numeric",
|
||||
'Hour-of-day with the most "done" reactions in the last 30 days (0-23).',
|
||||
name="preferred_hour",
|
||||
dtype="numeric",
|
||||
description='Hour-of-day with the most "done" reactions in the last 30 days (0–23).',
|
||||
freshness="batched",
|
||||
ttl_sec=_DAY,
|
||||
source="profile_store",
|
||||
fallback="null — serving normalises to 0.5 (neutral alignment)",
|
||||
),
|
||||
ProfileFeature(
|
||||
"tip_volume_30d", "numeric",
|
||||
"Number of tips served to the user in the last 30 days.",
|
||||
name="tip_volume_30d",
|
||||
dtype="numeric",
|
||||
description="Number of tips served to the user in the last 30 days.",
|
||||
freshness="batched",
|
||||
ttl_sec=_HOUR,
|
||||
source="profile_store",
|
||||
fallback="0",
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
"""Tests for ml/features/context.py"""
|
||||
import pytest
|
||||
import sys, os; sys.path.insert(0, os.path.dirname(__file__))
|
||||
from context import build_context, TaskSignal, PromptContext
|
||||
from context import build_context, TaskSignal, PromptContext, CONTEXT_FEATURES
|
||||
|
||||
|
||||
def test_empty_tasks():
|
||||
@@ -62,3 +62,30 @@ def test_due_date_none_preserved():
|
||||
tasks = [TaskSignal(id="x", content="No due", due_date=None)]
|
||||
ctx = build_context(tasks)
|
||||
assert ctx.tasks[0]["due_date"] is None
|
||||
|
||||
|
||||
# ── CONTEXT_FEATURES spec tests (issue #61) ──────────────────────────────────
|
||||
|
||||
def test_context_features_expected_names():
|
||||
names = {f.name for f in CONTEXT_FEATURES}
|
||||
assert names == {"hour_of_day", "day_of_week", "tasks"}
|
||||
|
||||
|
||||
def test_context_features_all_jit():
|
||||
for f in CONTEXT_FEATURES:
|
||||
assert f.freshness == "jit", f"{f.name}: expected freshness='jit', got {f.freshness!r}"
|
||||
|
||||
|
||||
def test_context_features_source_set():
|
||||
for f in CONTEXT_FEATURES:
|
||||
assert f.source, f"{f.name}: source must not be empty"
|
||||
|
||||
|
||||
def test_context_features_fallback_set():
|
||||
for f in CONTEXT_FEATURES:
|
||||
assert f.fallback, f"{f.name}: fallback must not be empty"
|
||||
|
||||
|
||||
def test_context_features_no_duplicates():
|
||||
names = [f.name for f in CONTEXT_FEATURES]
|
||||
assert len(names) == len(set(names)), f"duplicate names: {names}"
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
"""Smoke test for profile_schema mirror (#81 phase A).
|
||||
"""Smoke test for profile_schema mirror (#81 phase A, #61 freshness spec).
|
||||
|
||||
The TS registry in services/api/src/profile/registry.ts is the source of truth.
|
||||
This test checks the names listed here match the registry by reading the TS
|
||||
@@ -14,6 +14,18 @@ from ml.features.profile_schema import PROFILE_FEATURES, feature_names
|
||||
|
||||
REGISTRY_PATH = Path(__file__).resolve().parents[2] / "services" / "api" / "src" / "profile" / "registry.ts"
|
||||
|
||||
_HOUR = 3600
|
||||
_DAY = 86_400
|
||||
|
||||
# Expected ttl_sec values mirrored from registry.ts — keeps the two in sync.
|
||||
_EXPECTED_TTL: dict[str, int] = {
|
||||
"completion_rate_30d": 6 * _HOUR,
|
||||
"dismiss_rate_30d": 6 * _HOUR,
|
||||
"mean_dwell_ms_30d": 6 * _HOUR,
|
||||
"preferred_hour": _DAY,
|
||||
"tip_volume_30d": _HOUR,
|
||||
}
|
||||
|
||||
|
||||
def _ts_registry_names() -> set[str]:
|
||||
text = REGISTRY_PATH.read_text(encoding="utf-8")
|
||||
@@ -21,6 +33,35 @@ def _ts_registry_names() -> set[str]:
|
||||
return set(re.findall(r"name:\s*'([a-zA-Z0-9_]+)'", text))
|
||||
|
||||
|
||||
def _ts_registry_ttls() -> dict[str, int]:
|
||||
"""Parse ttlSec values from registry.ts (crude but sufficient for drift detection).
|
||||
|
||||
Handles TS symbolic constants (HOUR, DAY) and expressions like ``6 * HOUR``.
|
||||
"""
|
||||
text = REGISTRY_PATH.read_text(encoding="utf-8")
|
||||
|
||||
# Extract numeric constants: `const HOUR = 3600;` or `const DAY = 86_400;`
|
||||
consts: dict[str, int] = {}
|
||||
for m in re.finditer(r"const\s+([A-Z_]+)\s*=\s*([\d_]+)", text):
|
||||
consts[m.group(1)] = int(m.group(2).replace("_", ""))
|
||||
|
||||
def _eval_expr(expr: str) -> int:
|
||||
tokens = [t.strip() for t in expr.split("*")]
|
||||
result = 1
|
||||
for t in tokens:
|
||||
result *= consts[t] if t in consts else int(t)
|
||||
return result
|
||||
|
||||
result: dict[str, int] = {}
|
||||
for block in re.split(r"\{", text):
|
||||
name_m = re.search(r"name:\s*'([a-zA-Z0-9_]+)'", block)
|
||||
# ttlSec may be a constant name, a number, or `N * CONST`
|
||||
ttl_m = re.search(r"ttlSec:\s*([A-Za-z0-9_]+(?:\s*\*\s*[A-Za-z0-9_]+)?)", block)
|
||||
if name_m and ttl_m:
|
||||
result[name_m.group(1)] = _eval_expr(ttl_m.group(1))
|
||||
return result
|
||||
|
||||
|
||||
def test_python_mirror_matches_ts_registry():
|
||||
py_names = feature_names()
|
||||
ts_names = _ts_registry_names()
|
||||
@@ -39,3 +80,34 @@ def test_profile_schema_no_duplicates():
|
||||
def test_profile_schema_dtypes_known():
|
||||
for f in PROFILE_FEATURES:
|
||||
assert f.dtype in {"numeric", "categorical"}
|
||||
|
||||
|
||||
def test_all_profile_features_are_batched():
|
||||
for f in PROFILE_FEATURES:
|
||||
assert f.freshness == "batched", f"{f.name}: expected freshness='batched', got {f.freshness!r}"
|
||||
|
||||
|
||||
def test_profile_feature_ttl_matches_ts_registry():
|
||||
ts_ttls = _ts_registry_ttls()
|
||||
for f in PROFILE_FEATURES:
|
||||
assert f.name in ts_ttls, f"{f.name} not found in TS registry ttlSec parse"
|
||||
assert f.ttl_sec == ts_ttls[f.name], (
|
||||
f"{f.name}: Python ttl_sec={f.ttl_sec} != TS ttlSec={ts_ttls[f.name]}"
|
||||
)
|
||||
|
||||
|
||||
def test_profile_feature_ttl_matches_expected():
|
||||
for f in PROFILE_FEATURES:
|
||||
assert f.ttl_sec == _EXPECTED_TTL[f.name], (
|
||||
f"{f.name}: ttl_sec={f.ttl_sec}, expected {_EXPECTED_TTL[f.name]}"
|
||||
)
|
||||
|
||||
|
||||
def test_profile_feature_source_is_profile_store():
|
||||
for f in PROFILE_FEATURES:
|
||||
assert f.source == "profile_store", f"{f.name}: unexpected source {f.source!r}"
|
||||
|
||||
|
||||
def test_profile_feature_fallback_set():
|
||||
for f in PROFILE_FEATURES:
|
||||
assert f.fallback, f"{f.name}: fallback must not be empty"
|
||||
|
||||
124
ml/pipelines/sim_dag.py
Normal file
124
ml/pipelines/sim_dag.py
Normal file
@@ -0,0 +1,124 @@
|
||||
"""
|
||||
Airflow DAG: bandit_sim
|
||||
|
||||
Runs a bandit policy simulation and logs results to MLflow.
|
||||
Triggered on-demand from the oO admin panel or manually from the Airflow UI.
|
||||
|
||||
Required conf keys (passed via dag_run.conf):
|
||||
sim_run_id str — oO SQLite run ID for callback correlation
|
||||
n_users int — number of synthetic users
|
||||
n_rounds int — rounds per user
|
||||
tasks_per_round int — candidate pool size per round
|
||||
policies list — policy names to compare
|
||||
judge_mode str — "rule" | "llm"
|
||||
ml_url str — ml/serving URL (e.g. http://ml-serving:8000)
|
||||
mlflow_url str — MLflow tracking URI (e.g. http://mlflow:5000/mlflow)
|
||||
callback_url str — oO API callback endpoint
|
||||
internal_token str — x-internal-token header value
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
from airflow import DAG
|
||||
from airflow.operators.python import PythonOperator
|
||||
|
||||
|
||||
def _run_sim(**context: object) -> dict:
|
||||
conf: dict = context["dag_run"].conf or {}
|
||||
|
||||
n_users = int(conf.get("n_users", 5))
|
||||
n_rounds = int(conf.get("n_rounds", 20))
|
||||
tasks_per_round = int(conf.get("tasks_per_round", 8))
|
||||
policies = list(conf.get("policies", ["linucb-v1", "egreedy-v1"]))
|
||||
judge_mode = str(conf.get("judge_mode", "rule"))
|
||||
ml_url = str(conf.get("ml_url", "http://ml-serving:8000"))
|
||||
mlflow_url = str(conf.get("mlflow_url", os.environ.get("MLFLOW_TRACKING_URI", "")))
|
||||
mlflow_experiment = "bandit_simulation"
|
||||
|
||||
sys.path.insert(0, "/opt/airflow/ml/experiments/sim")
|
||||
from runner import run_simulation # type: ignore[import]
|
||||
|
||||
use_llm = judge_mode == "llm"
|
||||
result = run_simulation(
|
||||
n_users=n_users,
|
||||
n_rounds=n_rounds,
|
||||
tasks_per_round=tasks_per_round,
|
||||
ml_url=ml_url,
|
||||
policies=policies,
|
||||
use_llm=use_llm,
|
||||
seed=42,
|
||||
mlflow_url=mlflow_url or None,
|
||||
mlflow_experiment=mlflow_experiment,
|
||||
)
|
||||
return result
|
||||
|
||||
|
||||
def _callback(**context: object) -> None:
|
||||
import httpx
|
||||
|
||||
conf: dict = context["dag_run"].conf or {}
|
||||
callback_url: str = str(conf.get("callback_url", ""))
|
||||
internal_token: str = str(conf.get("internal_token", ""))
|
||||
|
||||
if not callback_url or not internal_token:
|
||||
print("No callback_url or internal_token — skipping result push.", flush=True)
|
||||
return
|
||||
|
||||
result: dict = context["ti"].xcom_pull(task_ids="run_sim")
|
||||
if not result:
|
||||
print("No result from run_sim task — callback skipped.", flush=True)
|
||||
return
|
||||
|
||||
payload = {
|
||||
"summary": result.get("summary", {}),
|
||||
"winner": result.get("winner", ""),
|
||||
"persona_breakdown": result.get("persona_breakdown", {}),
|
||||
"events": result.get("events", []),
|
||||
"mlflow_run_id": result.get("mlflow_run_id"),
|
||||
}
|
||||
|
||||
try:
|
||||
r = httpx.post(
|
||||
callback_url,
|
||||
json=payload,
|
||||
headers={"x-internal-token": internal_token},
|
||||
timeout=30.0,
|
||||
)
|
||||
r.raise_for_status()
|
||||
print(f"Callback OK: {r.status_code}", flush=True)
|
||||
except Exception as exc:
|
||||
print(f"Callback failed: {exc}", flush=True)
|
||||
raise
|
||||
|
||||
|
||||
with DAG(
|
||||
dag_id="bandit_sim",
|
||||
description="On-demand bandit policy simulation with MLflow tracking",
|
||||
schedule_interval=None,
|
||||
start_date=datetime(2025, 1, 1),
|
||||
catchup=False,
|
||||
tags=["bandit", "simulation", "ml"],
|
||||
default_args={
|
||||
"retries": 1,
|
||||
"retry_delay": timedelta(minutes=2),
|
||||
},
|
||||
) as dag:
|
||||
|
||||
run_sim = PythonOperator(
|
||||
task_id="run_sim",
|
||||
python_callable=_run_sim,
|
||||
provide_context=True,
|
||||
)
|
||||
|
||||
push_results = PythonOperator(
|
||||
task_id="push_results",
|
||||
python_callable=_callback,
|
||||
provide_context=True,
|
||||
)
|
||||
|
||||
run_sim >> push_results
|
||||
104
ml/serving/README.md
Normal file
104
ml/serving/README.md
Normal file
@@ -0,0 +1,104 @@
|
||||
# ml/serving
|
||||
|
||||
FastAPI online scorer, tip generator, and JetStream consumer.
|
||||
|
||||
## Contract
|
||||
|
||||
| Endpoint | Description |
|
||||
|----------|-------------|
|
||||
| `POST /score` | LinUCB d=5 (baseline, shadow-eligible) |
|
||||
| `POST /score/egreedy` | ε-greedy v1, d=7 (active policy — ADR-0007) |
|
||||
| `POST /score/egreedy/v2` | ε-greedy v2, d=12 + profile features (shadow — ADR-0012) |
|
||||
| `POST /reward` / `/reward/egreedy` / `/reward/egreedy/v2` | Online reward update per policy |
|
||||
| `POST /generate` | LLM tip candidates via LiteLLM `tip-generator` alias |
|
||||
| `GET /stats/{user_id}` / `/stats/egreedy/{user_id}` / `/stats/egreedy/v2/{user_id}` | Per-user policy stats |
|
||||
| `GET /features/{user_id}` | Last 100 scored feature vectors (ring buffer) |
|
||||
| `POST /reset/{user_id}` | Clear all per-user bandit state (admin) |
|
||||
| `GET /health` | `{ ok, nats: { enabled, consumers: { signals, feedback } } }` |
|
||||
|
||||
Called by `services/api/src/recommender/` over HTTP. Contract is stable across policy swaps.
|
||||
|
||||
## Feature dimensions
|
||||
|
||||
| Policy | d | Extra dims vs previous |
|
||||
|--------|---|------------------------|
|
||||
| LinUCB v1 | 5 | hour_sin/cos, is_overdue, task_age, priority |
|
||||
| ε-greedy v1 | 7 | + dow_sin/cos |
|
||||
| ε-greedy v2 | 12 | + 5 profile features (ADR-0012) |
|
||||
|
||||
Profile features are computed by the TypeScript API and shipped on each `/score` call as `profile_features`. See `ml/README.md` and ADR-0011.
|
||||
|
||||
## JetStream consumers
|
||||
|
||||
On startup, `nats_consumer.py` registers two durable push consumers against NATS JetStream:
|
||||
|
||||
| Consumer | Stream | Subjects | Durable name |
|
||||
|----------|--------|----------|--------------|
|
||||
| signals | `signals` | `signals.>` | `feature-pipeline-signals` |
|
||||
| feedback | `feedback` | `feedback.>` | `feature-pipeline-feedback` |
|
||||
|
||||
**Handled subjects:**
|
||||
- `signals.task.synced` — writes `{last_sync_ts, task_count}` to `{STATE_DIR}/{user}_sync.json`
|
||||
- `signals.tip.feedback` — logged for observability; reward update happens via the HTTP path in the recommender
|
||||
|
||||
**Payload validation:** each message is validated against the pydantic models in `schemas.py` (mirroring `packages/shared-types/events/oo/events/v1/`). A `ValidationError` triggers a nak so the message is redelivered rather than silently dropped.
|
||||
|
||||
**Ack semantics:** explicit ack on success; nak for redelivery on error; dead-lettered after `NATS_MAX_DELIVER` attempts.
|
||||
|
||||
**Disabled** when `NATS_URL` is unset (default in local dev without NATS). No import of `nats-py` occurs in that case.
|
||||
|
||||
## Observability
|
||||
|
||||
Logs are structured JSON via **structlog**. Every line includes `level`, `logger`, `timestamp`, and — when a W3C `traceparent` header is present on the incoming request — `trace_id` bound via Python `contextvars`, so all log lines within a request carry the same trace ID as the upstream API call.
|
||||
|
||||
Sentry error capture is active when `SENTRY_DSN` is set.
|
||||
|
||||
## Config
|
||||
|
||||
| Env var | Default | Description |
|
||||
|---------|---------|-------------|
|
||||
| `STATE_DIR` | `/tmp/oo-bandit-state` | Directory for per-user bandit state JSON files |
|
||||
| `LITELLM_URL` | `http://localhost:4000` | LiteLLM gateway |
|
||||
| `LITELLM_MASTER_KEY` | `sk-oo-dev` | LiteLLM auth key |
|
||||
| `NATS_URL` | `` | NATS broker URL; empty = consumers disabled |
|
||||
| `NATS_DURABLE_PREFIX` | `feature-pipeline` | Prefix for durable consumer names |
|
||||
| `NATS_MAX_DELIVER` | `5` | Max redelivery attempts before dropping |
|
||||
| `DEFAULT_PROMPT_VERSION` | `v1` | Fallback prompt version for `/generate` |
|
||||
| `ENV` | `development` | Environment label (passed to Sentry) |
|
||||
| `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
|
||||
|
||||
## Health story
|
||||
|
||||
`GET /health` returns `{ ok: true }` plus NATS consumer state:
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"nats": {
|
||||
"enabled": true,
|
||||
"consumers": {
|
||||
"signals": { "last_msg_ts": "2026-04-25T10:00:00Z", "processed": 42, "errors": 0 },
|
||||
"feedback": { "last_msg_ts": null, "processed": 0, "errors": 0 }
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`last_msg_ts` is `null` until the first message arrives. Used by docker-compose healthcheck.
|
||||
|
||||
## Extraction criteria
|
||||
|
||||
Extract to its own process (already is one). Extract to a dedicated host / GPU node when:
|
||||
- p99 scoring latency exceeds 50 ms under load, **or**
|
||||
- model weights are too large to share memory with the Python process on the current host.
|
||||
|
||||
## State
|
||||
|
||||
Per-user bandit state is stored as JSON files in `STATE_DIR`:
|
||||
|
||||
| File pattern | Policy |
|
||||
|---|---|
|
||||
| `{user}.json` | LinUCB v1 |
|
||||
| `{user}_egreedy.json` | ε-greedy v1 |
|
||||
| `{user}_egreedy_v2.json` | ε-greedy v2 |
|
||||
| `{user}_sync.json` | Last task sync metadata (written by JetStream consumer) |
|
||||
20
ml/serving/logging_config.py
Normal file
20
ml/serving/logging_config.py
Normal file
@@ -0,0 +1,20 @@
|
||||
"""Structlog JSON configuration — import once at process start."""
|
||||
import logging
|
||||
import structlog
|
||||
|
||||
|
||||
def configure() -> None:
|
||||
structlog.configure(
|
||||
processors=[
|
||||
structlog.contextvars.merge_contextvars,
|
||||
structlog.stdlib.add_log_level,
|
||||
structlog.stdlib.add_logger_name,
|
||||
structlog.processors.TimeStamper(fmt="iso"),
|
||||
structlog.processors.StackInfoRenderer(),
|
||||
structlog.processors.JSONRenderer(),
|
||||
],
|
||||
wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
|
||||
context_class=dict,
|
||||
logger_factory=structlog.PrintLoggerFactory(),
|
||||
)
|
||||
logging.basicConfig(level=logging.WARNING)
|
||||
@@ -28,17 +28,55 @@ import math
|
||||
import os
|
||||
import time
|
||||
from collections import deque
|
||||
from contextlib import asynccontextmanager
|
||||
from pathlib import Path
|
||||
from typing import Optional, Deque
|
||||
|
||||
import httpx
|
||||
import numpy as np
|
||||
from fastapi import FastAPI, HTTPException
|
||||
import sentry_sdk
|
||||
import structlog
|
||||
import structlog.contextvars
|
||||
from fastapi import FastAPI, HTTPException, Request
|
||||
from pydantic import BaseModel
|
||||
from starlette.middleware.base import BaseHTTPMiddleware
|
||||
|
||||
import logging_config
|
||||
import nats_consumer
|
||||
from prompts import get_prompt
|
||||
|
||||
app = FastAPI(title="oO ML Serving", version="1.0.0")
|
||||
logging_config.configure()
|
||||
|
||||
_SENTRY_DSN = os.getenv("SENTRY_DSN")
|
||||
if _SENTRY_DSN:
|
||||
sentry_sdk.init(dsn=_SENTRY_DSN, environment=os.getenv("ENV", "development"))
|
||||
|
||||
log = structlog.get_logger()
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
await nats_consumer.start(STATE_DIR)
|
||||
yield
|
||||
await nats_consumer.stop()
|
||||
|
||||
|
||||
app = FastAPI(title="oO ML Serving", version="1.0.0", lifespan=lifespan)
|
||||
|
||||
|
||||
class _TracingMiddleware(BaseHTTPMiddleware):
|
||||
async def dispatch(self, request: Request, call_next):
|
||||
structlog.contextvars.clear_contextvars()
|
||||
traceparent = request.headers.get("traceparent", "")
|
||||
if traceparent:
|
||||
parts = traceparent.split("-")
|
||||
trace_id = parts[1] if len(parts) == 4 and len(parts[1]) == 32 else None
|
||||
if trace_id:
|
||||
structlog.contextvars.bind_contextvars(trace_id=trace_id)
|
||||
return await call_next(request)
|
||||
|
||||
|
||||
app.add_middleware(_TracingMiddleware)
|
||||
|
||||
LITELLM_URL = os.getenv("LITELLM_URL", "http://localhost:4000")
|
||||
LITELLM_MASTER_KEY = os.getenv("LITELLM_MASTER_KEY", "sk-oo-dev")
|
||||
@@ -315,7 +353,13 @@ class GenerateResponse(BaseModel):
|
||||
|
||||
@app.get("/health")
|
||||
def health():
|
||||
return {"ok": True}
|
||||
return {
|
||||
"ok": True,
|
||||
"nats": {
|
||||
"enabled": bool(nats_consumer.NATS_URL),
|
||||
"consumers": nats_consumer.consumer_health,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
_RETRY_SUFFIX = (
|
||||
|
||||
146
ml/serving/nats_consumer.py
Normal file
146
ml/serving/nats_consumer.py
Normal file
@@ -0,0 +1,146 @@
|
||||
"""
|
||||
JetStream durable consumers for ml/serving.
|
||||
|
||||
Streams:
|
||||
signals (subjects: signals.>) — durable: {prefix}-signals
|
||||
feedback (subjects: feedback.>) — durable: {prefix}-feedback
|
||||
|
||||
Handled subjects:
|
||||
signals.task.synced → write per-user sync metadata to STATE_DIR
|
||||
signals.tip.feedback → log for observability (reward is applied via HTTP path)
|
||||
|
||||
Config (env vars):
|
||||
NATS_URL — broker URL; empty = consumers disabled (default: "")
|
||||
NATS_DURABLE_PREFIX — prefix for durable consumer names (default: "feature-pipeline")
|
||||
NATS_MAX_DELIVER — max redelivery attempts before dropping (default: 5)
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import structlog
|
||||
from schemas import TaskSyncedPayload, TipFeedbackPayload
|
||||
|
||||
log = structlog.get_logger(__name__)
|
||||
|
||||
NATS_URL = os.getenv("NATS_URL", "")
|
||||
NATS_DURABLE_PREFIX = os.getenv("NATS_DURABLE_PREFIX", "feature-pipeline")
|
||||
NATS_MAX_DELIVER = int(os.getenv("NATS_MAX_DELIVER", "5"))
|
||||
|
||||
# Exposed to /health
|
||||
consumer_health: dict[str, dict] = {
|
||||
"signals": {"last_msg_ts": None, "processed": 0, "errors": 0},
|
||||
"feedback": {"last_msg_ts": None, "processed": 0, "errors": 0},
|
||||
}
|
||||
|
||||
_nc = None # nats.aio.Client
|
||||
_subs: list = [] # active JetStream subscriptions
|
||||
|
||||
|
||||
# ── Subject handlers ───────────────────────────────────────────────────────
|
||||
|
||||
def _sync_meta_path(state_dir: Path, user_id: str) -> Path:
|
||||
safe = "".join(c if c.isalnum() else "_" for c in user_id)
|
||||
return state_dir / f"{safe}_sync.json"
|
||||
|
||||
|
||||
async def _handle(subject: str, payload: dict, state_dir: Path) -> None:
|
||||
if subject == "signals.task.synced":
|
||||
msg = TaskSyncedPayload.model_validate(payload)
|
||||
p = _sync_meta_path(state_dir, msg.userId)
|
||||
p.write_text(json.dumps({
|
||||
"last_sync_ts": msg.syncedAt,
|
||||
"task_count": msg.count,
|
||||
}))
|
||||
log.info("nats: task_synced", user_id=msg.userId, count=msg.count)
|
||||
elif subject == "signals.tip.feedback":
|
||||
msg = TipFeedbackPayload.model_validate(payload)
|
||||
log.info("nats: tip_feedback", user_id=msg.userId, tip_id=msg.tipId, action=msg.action, reward=msg.reward)
|
||||
else:
|
||||
log.debug("nats: unhandled subject", subject=subject)
|
||||
|
||||
|
||||
# ── Consumer factory ───────────────────────────────────────────────────────
|
||||
|
||||
def _make_handler(key: str, state_dir: Path):
|
||||
"""Return an async push-consumer callback that acks on success, naks on error."""
|
||||
async def handler(msg) -> None:
|
||||
consumer_health[key]["last_msg_ts"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
||||
try:
|
||||
payload = json.loads(msg.data)
|
||||
await _handle(msg.subject, payload, state_dir)
|
||||
await msg.ack()
|
||||
consumer_health[key]["processed"] += 1
|
||||
except Exception as exc:
|
||||
consumer_health[key]["errors"] += 1
|
||||
log.warning("nats: processing error", key=key, subject=msg.subject, exc=str(exc))
|
||||
await msg.nak()
|
||||
return handler
|
||||
|
||||
|
||||
# ── Lifecycle ──────────────────────────────────────────────────────────────
|
||||
|
||||
async def start(state_dir: Path) -> None:
|
||||
"""Connect to NATS and register durable push consumers. No-op if NATS_URL is unset."""
|
||||
global _nc
|
||||
if not NATS_URL:
|
||||
log.info("nats: NATS_URL unset — JetStream consumers disabled")
|
||||
return
|
||||
|
||||
try:
|
||||
import nats as nats_lib
|
||||
from nats.js.api import ConsumerConfig, AckPolicy
|
||||
|
||||
_nc = await nats_lib.connect(
|
||||
NATS_URL,
|
||||
name="ml-serving",
|
||||
reconnect_time_wait=5,
|
||||
max_reconnect_attempts=-1,
|
||||
)
|
||||
js = _nc.jetstream()
|
||||
log.info("nats: connected", url=NATS_URL)
|
||||
except Exception as exc:
|
||||
log.warning("nats: connection failed — consumers disabled", exc=str(exc))
|
||||
_nc = None
|
||||
return
|
||||
|
||||
config = ConsumerConfig(
|
||||
ack_policy=AckPolicy.EXPLICIT,
|
||||
max_deliver=NATS_MAX_DELIVER,
|
||||
)
|
||||
|
||||
for key, subject in [("signals", "signals.>"), ("feedback", "feedback.>")]:
|
||||
durable = f"{NATS_DURABLE_PREFIX}-{key}"
|
||||
try:
|
||||
sub = await js.subscribe(
|
||||
subject,
|
||||
durable=durable,
|
||||
cb=_make_handler(key, state_dir),
|
||||
config=config,
|
||||
)
|
||||
_subs.append(sub)
|
||||
log.info("nats: subscribed", subject=subject, durable=durable)
|
||||
except Exception as exc:
|
||||
log.warning("nats: subscribe failed", key=key, exc=str(exc))
|
||||
|
||||
|
||||
async def stop() -> None:
|
||||
"""Drain subscriptions and close NATS connection."""
|
||||
global _nc
|
||||
for sub in _subs:
|
||||
try:
|
||||
await sub.unsubscribe()
|
||||
except Exception:
|
||||
pass
|
||||
_subs.clear()
|
||||
if _nc:
|
||||
try:
|
||||
await _nc.drain()
|
||||
except Exception:
|
||||
pass
|
||||
_nc = None
|
||||
log.info("nats: disconnected")
|
||||
@@ -4,3 +4,6 @@ pydantic==2.10.4
|
||||
numpy>=1.26.0
|
||||
httpx>=0.27.0
|
||||
anthropic>=0.40.0
|
||||
nats-py>=2.9.0
|
||||
structlog>=24.1.0
|
||||
sentry-sdk>=2.0.0
|
||||
|
||||
50
ml/serving/schemas.py
Normal file
50
ml/serving/schemas.py
Normal file
@@ -0,0 +1,50 @@
|
||||
"""
|
||||
Pydantic models mirroring oo.events.v1 proto schemas.
|
||||
|
||||
Field names use camelCase to match the proto3 JSON mapping convention
|
||||
and the TypeScript payload shapes published by services/api.
|
||||
|
||||
Keep in sync with packages/shared-types/events/oo/events/v1/.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Literal, Optional
|
||||
from pydantic import BaseModel
|
||||
|
||||
|
||||
class TaskSyncedPayload(BaseModel):
|
||||
userId: str
|
||||
source: str
|
||||
count: int
|
||||
syncedAt: str
|
||||
|
||||
|
||||
class TipServedPayload(BaseModel):
|
||||
userId: str
|
||||
tipId: str
|
||||
policy: str
|
||||
servedAt: str
|
||||
|
||||
|
||||
class TipFeedbackPayload(BaseModel):
|
||||
userId: str
|
||||
tipId: str
|
||||
action: Literal['done', 'dismiss', 'snooze', 'helpful', 'not_helpful']
|
||||
reward: float
|
||||
dwellMs: Optional[int] = None
|
||||
createdAt: str
|
||||
|
||||
|
||||
class TipRewardFailedPayload(BaseModel):
|
||||
userId: str
|
||||
tipId: str
|
||||
reward: float
|
||||
attempts: int
|
||||
error: str
|
||||
failedAt: str
|
||||
|
||||
|
||||
class IntegrationTokenExpiredPayload(BaseModel):
|
||||
userId: str
|
||||
provider: str
|
||||
detectedAt: str
|
||||
169
ml/serving/tests/test_schemas_and_consumer.py
Normal file
169
ml/serving/tests/test_schemas_and_consumer.py
Normal file
@@ -0,0 +1,169 @@
|
||||
"""
|
||||
Tests for schemas.py and nats_consumer._handle.
|
||||
"""
|
||||
import json
|
||||
import pytest
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from pydantic import ValidationError
|
||||
from unittest.mock import AsyncMock
|
||||
|
||||
from schemas import (
|
||||
TaskSyncedPayload,
|
||||
TipServedPayload,
|
||||
TipFeedbackPayload,
|
||||
TipRewardFailedPayload,
|
||||
IntegrationTokenExpiredPayload,
|
||||
)
|
||||
from nats_consumer import _handle, _sync_meta_path
|
||||
|
||||
|
||||
# ── Schema validation ─────────────────────────────────────────────────────────
|
||||
|
||||
class TestTaskSyncedPayload:
|
||||
def test_valid(self):
|
||||
p = TaskSyncedPayload.model_validate(
|
||||
{"userId": "u1", "source": "todoist", "count": 5, "syncedAt": "2026-04-25T10:00:00Z"}
|
||||
)
|
||||
assert p.userId == "u1"
|
||||
assert p.count == 5
|
||||
|
||||
def test_missing_field_raises(self):
|
||||
with pytest.raises(ValidationError):
|
||||
TaskSyncedPayload.model_validate({"userId": "u1", "source": "todoist"})
|
||||
|
||||
def test_wrong_type_raises(self):
|
||||
with pytest.raises(ValidationError):
|
||||
TaskSyncedPayload.model_validate(
|
||||
{"userId": "u1", "source": "todoist", "count": "not-an-int", "syncedAt": "2026-04-25T10:00:00Z"}
|
||||
)
|
||||
|
||||
|
||||
class TestTipFeedbackPayload:
|
||||
def test_valid_without_dwell(self):
|
||||
p = TipFeedbackPayload.model_validate(
|
||||
{"userId": "u1", "tipId": "t1", "action": "done", "reward": 1.0, "createdAt": "2026-04-25T10:00:00Z"}
|
||||
)
|
||||
assert p.dwellMs is None
|
||||
|
||||
def test_valid_with_dwell(self):
|
||||
p = TipFeedbackPayload.model_validate(
|
||||
{"userId": "u1", "tipId": "t1", "action": "helpful", "reward": 0.5,
|
||||
"dwellMs": 3200, "createdAt": "2026-04-25T10:00:00Z"}
|
||||
)
|
||||
assert p.dwellMs == 3200
|
||||
|
||||
def test_invalid_action_raises(self):
|
||||
with pytest.raises(ValidationError):
|
||||
TipFeedbackPayload.model_validate(
|
||||
{"userId": "u1", "tipId": "t1", "action": "like", "reward": 1.0, "createdAt": "2026-04-25T10:00:00Z"}
|
||||
)
|
||||
|
||||
def test_all_valid_actions(self):
|
||||
for action in ("done", "dismiss", "snooze", "helpful", "not_helpful"):
|
||||
p = TipFeedbackPayload.model_validate(
|
||||
{"userId": "u1", "tipId": "t1", "action": action, "reward": 0.0, "createdAt": "2026-04-25T10:00:00Z"}
|
||||
)
|
||||
assert p.action == action
|
||||
|
||||
|
||||
class TestOtherPayloads:
|
||||
def test_tip_served(self):
|
||||
p = TipServedPayload.model_validate(
|
||||
{"userId": "u1", "tipId": "t1", "policy": "egreedy-v2", "servedAt": "2026-04-25T10:00:00Z"}
|
||||
)
|
||||
assert p.policy == "egreedy-v2"
|
||||
|
||||
def test_tip_reward_failed(self):
|
||||
p = TipRewardFailedPayload.model_validate(
|
||||
{"userId": "u1", "tipId": "t1", "reward": 1.0, "attempts": 3,
|
||||
"error": "timeout", "failedAt": "2026-04-25T10:00:00Z"}
|
||||
)
|
||||
assert p.attempts == 3
|
||||
|
||||
def test_integration_token_expired(self):
|
||||
p = IntegrationTokenExpiredPayload.model_validate(
|
||||
{"userId": "u1", "provider": "todoist", "detectedAt": "2026-04-25T10:00:00Z"}
|
||||
)
|
||||
assert p.provider == "todoist"
|
||||
|
||||
|
||||
# ── _handle behaviour ─────────────────────────────────────────────────────────
|
||||
|
||||
TASK_SYNCED = {
|
||||
"userId": "user-abc",
|
||||
"source": "todoist",
|
||||
"count": 7,
|
||||
"syncedAt": "2026-04-25T10:00:00Z",
|
||||
}
|
||||
|
||||
TIP_FEEDBACK = {
|
||||
"userId": "user-abc",
|
||||
"tipId": "tip-xyz",
|
||||
"action": "done",
|
||||
"reward": 1.0,
|
||||
"dwellMs": 4200,
|
||||
"createdAt": "2026-04-25T10:00:00Z",
|
||||
}
|
||||
|
||||
|
||||
class TestHandle:
|
||||
@pytest.mark.asyncio
|
||||
async def test_task_synced_writes_meta_file(self):
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
state_dir = Path(tmp)
|
||||
await _handle("signals.task.synced", TASK_SYNCED, state_dir)
|
||||
meta_path = _sync_meta_path(state_dir, "user-abc")
|
||||
assert meta_path.exists()
|
||||
data = json.loads(meta_path.read_text())
|
||||
assert data["task_count"] == 7
|
||||
assert data["last_sync_ts"] == "2026-04-25T10:00:00Z"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_task_synced_bad_payload_raises(self):
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
with pytest.raises(ValidationError):
|
||||
await _handle("signals.task.synced", {"userId": "u1"}, Path(tmp))
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_tip_feedback_valid_does_not_raise(self):
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
# should log and return cleanly
|
||||
await _handle("signals.tip.feedback", TIP_FEEDBACK, Path(tmp))
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_tip_feedback_bad_action_raises(self):
|
||||
bad = {**TIP_FEEDBACK, "action": "unknown"}
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
with pytest.raises(ValidationError):
|
||||
await _handle("signals.tip.feedback", bad, Path(tmp))
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_unhandled_subject_is_ignored(self):
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
# should not raise for unknown subjects
|
||||
await _handle("signals.something.new", {"any": "data"}, Path(tmp))
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_make_handler_acks_on_success(self):
|
||||
from nats_consumer import _make_handler
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
handler = _make_handler("signals", Path(tmp))
|
||||
msg = AsyncMock()
|
||||
msg.subject = "signals.task.synced"
|
||||
msg.data = json.dumps(TASK_SYNCED).encode()
|
||||
await handler(msg)
|
||||
msg.ack.assert_awaited_once()
|
||||
msg.nak.assert_not_awaited()
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_make_handler_naks_on_validation_error(self):
|
||||
from nats_consumer import _make_handler
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
handler = _make_handler("signals", Path(tmp))
|
||||
msg = AsyncMock()
|
||||
msg.subject = "signals.task.synced"
|
||||
msg.data = json.dumps({"userId": "u1"}).encode() # missing fields
|
||||
await handler(msg)
|
||||
msg.nak.assert_awaited_once()
|
||||
msg.ack.assert_not_awaited()
|
||||
63
packages/shared-types/README.md
Normal file
63
packages/shared-types/README.md
Normal file
@@ -0,0 +1,63 @@
|
||||
# @oo/shared-types
|
||||
|
||||
Canonical contracts for all inter-module communication. Two surfaces:
|
||||
|
||||
| Surface | Format | Location |
|
||||
|---------|--------|----------|
|
||||
| HTTP (sync) | OpenAPI / TypeScript interfaces | `src/http/` |
|
||||
| Events (async) | Protocol Buffers + TS interfaces | `src/events/`, `events/` |
|
||||
|
||||
## HTTP types
|
||||
|
||||
Hand-written TypeScript interfaces generated from OpenAPI specs. Imported by
|
||||
`services/api`, `apps/web`, and `ml/serving` (Python hand-mirrors).
|
||||
|
||||
| File | Types |
|
||||
|------|-------|
|
||||
| `src/http/tip.ts` | `TipCandidate`, `RecommendResponse`, `TipFeedback` |
|
||||
| `src/http/auth.ts` | `SessionUser` |
|
||||
| `src/http/integrations.ts` | `IntegrationsResponse`, `Integration` |
|
||||
| `src/http/user.ts` | `UserProfile` |
|
||||
| `src/http/signal.ts` | `Signal`, `SignalSource` |
|
||||
|
||||
## Event types
|
||||
|
||||
Protobuf schemas live in `events/oo/events/v1/`. TypeScript interfaces in
|
||||
`src/events/index.ts` mirror the proto envelope and payload types.
|
||||
|
||||
| Proto file | Messages |
|
||||
|------------|----------|
|
||||
| `envelope.proto` | `Envelope` (wraps every event) |
|
||||
| `signals.proto` | `TaskSyncedPayload`, `TipServedPayload`, `TipFeedbackPayload`, `TipRewardFailedPayload` |
|
||||
| `integration.proto` | `IntegrationTokenExpiredPayload` |
|
||||
|
||||
**Schema evolution rules (ADR-0005):**
|
||||
- Additive changes only within a version (new fields, new message types).
|
||||
- Removed fields must be marked `reserved` — never reuse a field number.
|
||||
- Breaking changes require a new package version (`oo.events.v2`) and a `schemaVersion` bump in the envelope.
|
||||
|
||||
## Schema registry / CI gate
|
||||
|
||||
`buf` enforces lint and breaking-change detection on every PR that touches `events/`:
|
||||
|
||||
```bash
|
||||
# Lint
|
||||
buf lint events/
|
||||
|
||||
# Breaking-change check against main
|
||||
buf breaking events/ --against '.git#branch=main,subdir=packages/shared-types/events'
|
||||
```
|
||||
|
||||
Local shortcut: `./scripts/buf-check.sh`
|
||||
|
||||
CI: `.gitea/workflows/buf-check.yaml` (requires a Gitea Actions runner).
|
||||
|
||||
Install buf: `curl -sSfL https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64 -o /usr/local/bin/buf && chmod +x /usr/local/bin/buf`
|
||||
|
||||
## Contract
|
||||
|
||||
`/health` — not applicable (library package, no process).
|
||||
|
||||
**Extraction criteria** — always a shared library. Extract to a separate registry
|
||||
service only when schema governance requires independent versioning and deployment
|
||||
(e.g. external consumers, SLA divergence from the monorepo).
|
||||
7
packages/shared-types/events/buf.yaml
Normal file
7
packages/shared-types/events/buf.yaml
Normal file
@@ -0,0 +1,7 @@
|
||||
version: v1
|
||||
lint:
|
||||
use:
|
||||
- STANDARD
|
||||
breaking:
|
||||
use:
|
||||
- FILE
|
||||
25
packages/shared-types/events/oo/events/v1/envelope.proto
Normal file
25
packages/shared-types/events/oo/events/v1/envelope.proto
Normal file
@@ -0,0 +1,25 @@
|
||||
syntax = "proto3";
|
||||
package oo.events.v1;
|
||||
|
||||
import "oo/events/v1/signals.proto";
|
||||
import "oo/events/v1/integration.proto";
|
||||
|
||||
// Envelope wraps every event on the bus and on NATS JetStream.
|
||||
// Wire format: proto3 JSON (camelCase field names).
|
||||
// schema_version = "v1" — bump to "v2" only for breaking payload changes.
|
||||
message Envelope {
|
||||
string event_id = 1; // UUID assigned by bus on publish
|
||||
string occurred_at = 2; // ISO 8601
|
||||
string schema_version = 3; // "v1"
|
||||
string producer = 4; // e.g. "services/api"
|
||||
string subject = 5; // NATS-style subject: domain.entity.verb
|
||||
uint64 seq = 6; // monotonic sequence from the bus ring
|
||||
|
||||
oneof payload {
|
||||
TaskSyncedPayload task_synced = 10;
|
||||
TipServedPayload tip_served = 11;
|
||||
TipFeedbackPayload tip_feedback = 12;
|
||||
TipRewardFailedPayload tip_reward_failed = 13;
|
||||
IntegrationTokenExpiredPayload integration_token_expired = 14;
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,9 @@
|
||||
syntax = "proto3";
|
||||
package oo.events.v1;
|
||||
|
||||
// subject: signals.integration.token_expired
|
||||
message IntegrationTokenExpiredPayload {
|
||||
string user_id = 1;
|
||||
string provider = 2;
|
||||
string detected_at = 3; // ISO 8601
|
||||
}
|
||||
39
packages/shared-types/events/oo/events/v1/signals.proto
Normal file
39
packages/shared-types/events/oo/events/v1/signals.proto
Normal file
@@ -0,0 +1,39 @@
|
||||
syntax = "proto3";
|
||||
package oo.events.v1;
|
||||
|
||||
// subject: signals.task.synced
|
||||
message TaskSyncedPayload {
|
||||
string user_id = 1;
|
||||
string source = 2; // e.g. "todoist"
|
||||
int32 count = 3;
|
||||
string synced_at = 4; // ISO 8601
|
||||
}
|
||||
|
||||
// subject: signals.tip.served
|
||||
message TipServedPayload {
|
||||
string user_id = 1;
|
||||
string tip_id = 2;
|
||||
string policy = 3;
|
||||
string served_at = 4; // ISO 8601
|
||||
}
|
||||
|
||||
// subject: signals.tip.feedback
|
||||
// action: done | dismiss | snooze | helpful | not_helpful
|
||||
message TipFeedbackPayload {
|
||||
string user_id = 1;
|
||||
string tip_id = 2;
|
||||
string action = 3;
|
||||
double reward = 4;
|
||||
optional int64 dwell_ms = 5; // null when no dwell was recorded
|
||||
string created_at = 6; // ISO 8601
|
||||
}
|
||||
|
||||
// subject: signals.tip.reward_failed
|
||||
message TipRewardFailedPayload {
|
||||
string user_id = 1;
|
||||
string tip_id = 2;
|
||||
double reward = 3;
|
||||
int32 attempts = 4;
|
||||
string error = 5;
|
||||
string failed_at = 6; // ISO 8601
|
||||
}
|
||||
@@ -15,7 +15,9 @@
|
||||
"test": "vitest run",
|
||||
"test:watch": "vitest",
|
||||
"type-check": "tsc --noEmit",
|
||||
"clean": "rm -rf dist"
|
||||
"clean": "rm -rf dist",
|
||||
"buf:lint": "buf lint events",
|
||||
"buf:breaking": "buf breaking events --against '.git#branch=main,subdir=packages/shared-types/events'"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@vitest/coverage-v8": "^4.1.4",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
/**
|
||||
* NormalizedEvent — the durable envelope for all events flowing through
|
||||
* the system. Today: in-process EventEmitter. Tomorrow: NATS JetStream.
|
||||
* the system. Mirrors oo.events.v1.Envelope in packages/shared-types/events/.
|
||||
*
|
||||
* Subject taxonomy:
|
||||
* signals.task.synced — Todoist (or other source) task list refreshed
|
||||
@@ -10,10 +10,16 @@
|
||||
* signals.integration.token_expired — OAuth token needs reconnect
|
||||
*/
|
||||
export interface NormalizedEvent<T = unknown> {
|
||||
/** UUID assigned by bus on publish */
|
||||
eventId: string;
|
||||
/** NATS-style subject: domain.entity.verb */
|
||||
subject: string;
|
||||
/** ISO 8601 timestamp */
|
||||
ts: string;
|
||||
occurredAt: string;
|
||||
/** "v1" — bump for breaking payload changes; see packages/shared-types/events/ */
|
||||
schemaVersion: 'v1';
|
||||
/** e.g. "services/api" */
|
||||
producer: string;
|
||||
/** Monotonically increasing sequence number (in-process ring; JetStream seq in prod) */
|
||||
seq: number;
|
||||
payload: T;
|
||||
|
||||
@@ -4,5 +4,6 @@
|
||||
"outDir": "dist",
|
||||
"rootDir": "src"
|
||||
},
|
||||
"include": ["src"]
|
||||
"include": ["src"],
|
||||
"exclude": ["src/__tests__", "**/*.test.ts"]
|
||||
}
|
||||
|
||||
877
pnpm-lock.yaml
generated
877
pnpm-lock.yaml
generated
File diff suppressed because it is too large
Load Diff
24
scripts/buf-check.sh
Executable file
24
scripts/buf-check.sh
Executable file
@@ -0,0 +1,24 @@
|
||||
#!/usr/bin/env bash
|
||||
# Run buf lint and breaking-change detection locally.
|
||||
# Usage: ./scripts/buf-check.sh [against-branch]
|
||||
# Default against-branch: main
|
||||
set -euo pipefail
|
||||
|
||||
AGAINST="${1:-main}"
|
||||
ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
EVENTS="$ROOT/packages/shared-types/events"
|
||||
|
||||
if ! command -v buf &>/dev/null; then
|
||||
echo "buf not found. Install: https://buf.build/docs/installation"
|
||||
echo " curl -sSfL https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64 -o /usr/local/bin/buf && chmod +x /usr/local/bin/buf"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "==> buf lint"
|
||||
buf lint "$EVENTS"
|
||||
|
||||
echo "==> buf breaking against $AGAINST"
|
||||
buf breaking "$EVENTS" \
|
||||
--against ".git#branch=${AGAINST},subdir=packages/shared-types/events"
|
||||
|
||||
echo "All checks passed."
|
||||
91
services/api/README.md
Normal file
91
services/api/README.md
Normal file
@@ -0,0 +1,91 @@
|
||||
# services/api
|
||||
|
||||
Express BFF that serves all client-facing routes, manages sessions, runs background signal sync, and proxies admin calls to `ml/serving`.
|
||||
|
||||
## Contract
|
||||
|
||||
```
|
||||
GET /health { ok: true }
|
||||
|
||||
POST /api/auth/login → redirect to Google OAuth
|
||||
GET /api/auth/callback OAuth return URL
|
||||
POST /api/auth/logout
|
||||
GET /api/auth/session → { user? }
|
||||
POST /api/auth/token { token } → set sid cookie (ADMIN_TOKEN auth)
|
||||
|
||||
GET /api/integrations list connected integrations
|
||||
POST /api/integrations/todoist/connect start Todoist OAuth
|
||||
GET /api/integrations/todoist/callback
|
||||
DELETE /api/integrations/:provider disconnect
|
||||
|
||||
POST /api/recommend → { tip }
|
||||
POST /api/tip/:id/feedback { action } → { ok }
|
||||
|
||||
GET /api/user/profile
|
||||
DELETE /api/user account deletion
|
||||
|
||||
POST /api/push/subscribe
|
||||
DELETE /api/push/subscribe
|
||||
|
||||
GET /api/admin/stats DAU/WAU, feedback breakdown
|
||||
GET /api/admin/users
|
||||
GET /api/admin/events recent event stream (ring buffer)
|
||||
GET /api/admin/sim/runs offline sim run list
|
||||
POST /api/admin/sim/run launch offline sim
|
||||
GET /api/admin/sim/runs/:id/output tail sim stdout
|
||||
...
|
||||
|
||||
GET /api/ml/* admin-only proxy to ml/serving
|
||||
```
|
||||
|
||||
## Middleware stack (request order)
|
||||
|
||||
1. `cors` — origin limited to `WEB_BASE_URL`
|
||||
2. `tracingMiddleware` — reads or generates W3C `traceparent`; sets `req.traceId` + `req.traceparent`
|
||||
3. `pinoHttp` — structured JSON request/response logs with `traceId` field; `/health` suppressed
|
||||
4. `express.json()` / `cookieParser`
|
||||
5. `sessionMiddleware` — validates `sid` cookie, attaches `req.userId`
|
||||
|
||||
## Observability
|
||||
|
||||
Logs are structured JSON via **pino**. Every line includes `traceId` (extracted from the incoming W3C `traceparent` header, or generated fresh). The same `traceparent` is forwarded on all outbound HTTP calls to `ml/serving` so traces correlate end-to-end.
|
||||
|
||||
Sentry error capture is active when `SENTRY_DSN` is set.
|
||||
|
||||
## Background tasks
|
||||
|
||||
- **Todoist sync scheduler** — runs every `TODOIST_SYNC_INTERVAL_MS` (default 15 min); starts 10 s after boot to avoid startup surge.
|
||||
- **Retention purge** — deletes `tipScores` and `tipFeedback` rows older than 30 days; runs on boot and daily.
|
||||
- **Profile TTL invalidation** — listens to `signals.task.synced` and `signals.tip.feedback` on the in-process Bus; invalidates cached user-level profile features so the next `/recommend` gets fresh values.
|
||||
|
||||
## Config
|
||||
|
||||
| Env var | Default | Description |
|
||||
|---------|---------|-------------|
|
||||
| `PORT` | `3001` | Listen port |
|
||||
| `NODE_ENV` | `development` | Environment label |
|
||||
| `DATABASE_PATH` | `./data/oo.db` | SQLite file |
|
||||
| `SESSION_SECRET` | required | Cookie signing secret |
|
||||
| `GOOGLE_CLIENT_ID/SECRET` | required | OAuth |
|
||||
| `TODOIST_CLIENT_ID/SECRET` | required | OAuth |
|
||||
| `API_BASE_URL` | `http://localhost:3001` | Self-referential redirect URI |
|
||||
| `WEB_BASE_URL` | `http://localhost:3000` | CORS + post-login redirect |
|
||||
| `ML_SERVING_URL` | `http://localhost:8000` | ml/serving base URL |
|
||||
| `NATS_URL` | `` | NATS broker; empty = in-process bus only |
|
||||
| `TODOIST_SYNC_INTERVAL_MS` | `900000` | Background sync cadence |
|
||||
| `TIP_PROMPT_VERSION` | `` | Prompt variant(s) for `/generate` |
|
||||
| `LOG_LEVEL` | `info` | pino log level |
|
||||
| `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
|
||||
| `VAPID_*` | | Web push keys |
|
||||
| `ADMIN_TOKEN` | `` | Static token for service/Playwright admin auth; empty = disabled |
|
||||
|
||||
## Health story
|
||||
|
||||
`GET /health` returns `{ ok: true }`. No dependency checks — upstream deps (`ml/serving`, NATS) have their own health endpoints checked separately.
|
||||
|
||||
## Extraction criteria
|
||||
|
||||
Extract to its own host when:
|
||||
- Auth session management needs a dedicated Redis/PG session store, **or**
|
||||
- Background sync load (Todoist, future connectors) displaces API serving on the shared host, **or**
|
||||
- Team boundary emerges between auth/BFF and recommender orchestration.
|
||||
@@ -16,6 +16,7 @@
|
||||
},
|
||||
"dependencies": {
|
||||
"@oo/shared-types": "workspace:*",
|
||||
"@sentry/node": "^10.50.0",
|
||||
"better-sqlite3": "^11.8.1",
|
||||
"cookie-parser": "^1.4.7",
|
||||
"cors": "^2.8.5",
|
||||
@@ -27,6 +28,8 @@
|
||||
"nats": "^2.29.3",
|
||||
"node-fetch": "^3.3.2",
|
||||
"openid-client": "^6.3.4",
|
||||
"pino": "^10.3.1",
|
||||
"pino-http": "^11.0.0",
|
||||
"web-push": "^3.6.7",
|
||||
"zod": "^3.24.1"
|
||||
},
|
||||
|
||||
@@ -34,6 +34,17 @@ export const config = {
|
||||
ML_SERVING_URL: optional('ML_SERVING_URL', 'http://localhost:8000'),
|
||||
LITELLM_URL: optional('LITELLM_URL', 'http://localhost:4000'),
|
||||
|
||||
MLFLOW_URL: optional('MLFLOW_URL', 'http://localhost:5000'),
|
||||
AIRFLOW_URL: optional('AIRFLOW_URL', 'http://localhost:8080'),
|
||||
AIRFLOW_API_USER: optional('AIRFLOW_API_USER', 'admin'),
|
||||
AIRFLOW_API_PASSWORD: optional('AIRFLOW_API_PASSWORD', 'admin'),
|
||||
|
||||
/** Shared secret for internal Airflow→API callbacks. */
|
||||
INTERNAL_API_TOKEN: optional('INTERNAL_API_TOKEN', ''),
|
||||
|
||||
/** Static token for automated/service access to the admin panel (e.g. Playwright tests). */
|
||||
ADMIN_TOKEN: optional('ADMIN_TOKEN', ''),
|
||||
|
||||
VAPID_PUBLIC_KEY: optional('VAPID_PUBLIC_KEY', ''),
|
||||
VAPID_PRIVATE_KEY: optional('VAPID_PRIVATE_KEY', ''),
|
||||
VAPID_SUBJECT: optional('VAPID_SUBJECT', 'mailto:admin@localhost'),
|
||||
|
||||
@@ -156,6 +156,10 @@ export function runMigrations() {
|
||||
`ALTER TABLE tip_scores ADD COLUMN prompt_version TEXT`,
|
||||
`ALTER TABLE tip_scores ADD COLUMN llm_model TEXT`,
|
||||
`ALTER TABLE tip_scores ADD COLUMN tip_kind TEXT`,
|
||||
`ALTER TABLE sim_runs ADD COLUMN airflow_dag_run_id TEXT`,
|
||||
`ALTER TABLE sim_runs ADD COLUMN mlflow_run_id TEXT`,
|
||||
`ALTER TABLE sim_runs ADD COLUMN judge_mode TEXT NOT NULL DEFAULT 'rule'`,
|
||||
`ALTER TABLE sim_runs ADD COLUMN n_policies INTEGER NOT NULL DEFAULT 2`,
|
||||
]) {
|
||||
try { sqlite.exec(stmt); } catch { /* column already exists */ }
|
||||
}
|
||||
|
||||
@@ -112,9 +112,13 @@ export const simRuns = sqliteTable('sim_runs', {
|
||||
tasksPerRound: integer('tasks_per_round').notNull().default(8),
|
||||
useLlm: integer('use_llm', { mode: 'boolean' }).notNull().default(false),
|
||||
status: text('status').notNull().default('pending'), // 'pending'|'running'|'done'|'failed'
|
||||
judgeMode: text('judge_mode').notNull().default('rule'),
|
||||
nPolicies: integer('n_policies').notNull().default(2),
|
||||
summaryJson: text('summary_json'), // JSON: { [policy]: PolicySummary }
|
||||
winner: text('winner'),
|
||||
personaBreakdownJson: text('persona_breakdown_json'), // JSON: { [persona]: { [policy]: {reward,n} } }
|
||||
airflowDagRunId: text('airflow_dag_run_id'),
|
||||
mlflowRunId: text('mlflow_run_id'),
|
||||
createdAt: text('created_at').notNull(),
|
||||
finishedAt: text('finished_at'),
|
||||
});
|
||||
|
||||
@@ -56,7 +56,7 @@ describe('EventBus — delivery', () => {
|
||||
it('does not throw when publishing with no subscribers', () => {
|
||||
const b = makeBus();
|
||||
expect(() =>
|
||||
b.publish('signals.task.synced', { userId: 'u', count: 3, syncedAt: '' }),
|
||||
b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 3, syncedAt: '' }),
|
||||
).not.toThrow();
|
||||
});
|
||||
|
||||
@@ -101,7 +101,7 @@ describe('EventBus — ring buffer / tail()', () => {
|
||||
it('tail() filters by subject prefix', () => {
|
||||
const b = makeBus();
|
||||
b.publish('signals.tip.served', { userId: 'u', tipId: 't', policy: 'p', servedAt: '' });
|
||||
b.publish('signals.task.synced', { userId: 'u', count: 1, syncedAt: '' });
|
||||
b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 1, syncedAt: '' });
|
||||
|
||||
const tipEvents = b.tail({ subject: 'signals.tip' });
|
||||
expect(tipEvents.every((e) => e.subject.startsWith('signals.tip'))).toBe(true);
|
||||
@@ -178,7 +178,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
|
||||
const hook = vi.fn();
|
||||
b.onPublish(hook);
|
||||
|
||||
const payload = { userId: 'u', count: 2, syncedAt: 'now' };
|
||||
const payload = { userId: 'u', source: 'todoist', count: 2, syncedAt: 'now' };
|
||||
b.publish('signals.task.synced', payload);
|
||||
|
||||
expect(hook).toHaveBeenCalledOnce();
|
||||
@@ -191,7 +191,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
|
||||
b.onPublish(() => calls.push('a'));
|
||||
b.onPublish(() => calls.push('b'));
|
||||
|
||||
b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' });
|
||||
b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' });
|
||||
expect(calls).toEqual(['a', 'b']);
|
||||
});
|
||||
|
||||
@@ -202,7 +202,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
|
||||
b.onPublish(hook);
|
||||
b.subscribe('signals.task.synced', sub);
|
||||
|
||||
b.publish('signals.task.synced', { userId: 'u', count: 1, syncedAt: '' });
|
||||
b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 1, syncedAt: '' });
|
||||
expect(hook).toHaveBeenCalledOnce();
|
||||
expect(sub).toHaveBeenCalledOnce();
|
||||
});
|
||||
@@ -215,7 +215,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
|
||||
throw new Error('boom');
|
||||
});
|
||||
expect(() =>
|
||||
b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' }),
|
||||
b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' }),
|
||||
).toThrow('boom');
|
||||
});
|
||||
});
|
||||
|
||||
@@ -106,7 +106,7 @@ describe('connectNats — bridge bus → JetStream', () => {
|
||||
|
||||
await connectNats('nats://test:4222');
|
||||
|
||||
const payload = { userId: 'u1', count: 7, syncedAt: '2026-01-01T00:00:00Z' };
|
||||
const payload = { userId: 'u1', source: 'todoist', count: 7, syncedAt: '2026-01-01T00:00:00Z' };
|
||||
bus.publish('signals.task.synced', payload);
|
||||
|
||||
// Allow the queued microtask in the hook to flush.
|
||||
@@ -121,16 +121,17 @@ describe('connectNats — bridge bus → JetStream', () => {
|
||||
|
||||
it('swallows JetStream publish errors so the in-process bus keeps working', async () => {
|
||||
const { connectNats } = await import('../nats.js');
|
||||
const { logger } = await import('../../logger.js');
|
||||
const { bus } = await import('../bus.js');
|
||||
|
||||
await connectNats('nats://test:4222');
|
||||
|
||||
// Force the next js.publish to reject.
|
||||
lastJsPublish.mockRejectedValueOnce(new Error('jetstream down'));
|
||||
const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
|
||||
const errSpy = vi.spyOn(logger, 'error');
|
||||
|
||||
expect(() =>
|
||||
bus.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' }),
|
||||
bus.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' }),
|
||||
).not.toThrow();
|
||||
|
||||
// Wait a tick for the rejected promise's catch to run.
|
||||
@@ -142,12 +143,16 @@ describe('connectNats — bridge bus → JetStream', () => {
|
||||
describe('connectNats — failure mode', () => {
|
||||
it('logs a warning and stays silent when connect rejects', async () => {
|
||||
const { connectNats } = await import('../nats.js');
|
||||
const { logger } = await import('../../logger.js');
|
||||
|
||||
lastConnect.mockRejectedValueOnce(new Error('ECONNREFUSED'));
|
||||
const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {});
|
||||
const warnSpy = vi.spyOn(logger, 'warn');
|
||||
|
||||
await expect(connectNats('nats://nope:4222')).resolves.toBeUndefined();
|
||||
expect(warnSpy).toHaveBeenCalledWith(expect.stringContaining('connection failed'));
|
||||
expect(warnSpy).toHaveBeenCalledWith(
|
||||
expect.objectContaining({ err: expect.anything() }),
|
||||
expect.stringContaining('connection failed'),
|
||||
);
|
||||
});
|
||||
});
|
||||
|
||||
@@ -156,7 +161,7 @@ describe('Bus.onPublish contract — used by NATS bridge', () => {
|
||||
const b = new Bus();
|
||||
const hook = vi.fn();
|
||||
b.onPublish(hook);
|
||||
b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' });
|
||||
b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' });
|
||||
expect(hook).toHaveBeenCalledOnce();
|
||||
});
|
||||
});
|
||||
|
||||
@@ -45,6 +45,7 @@ export type RewardDeliveryFailedEvent = {
|
||||
|
||||
export type TaskSyncedEvent = {
|
||||
userId: string;
|
||||
source: string; // e.g. 'todoist'
|
||||
count: number;
|
||||
syncedAt: string;
|
||||
};
|
||||
|
||||
@@ -12,6 +12,7 @@
|
||||
|
||||
import type { NatsConnection, JetStreamClient, StreamConfig } from 'nats';
|
||||
import { bus } from './bus.js';
|
||||
import { logger } from '../logger.js';
|
||||
|
||||
let nc: NatsConnection | null = null;
|
||||
let js: JetStreamClient | null = null;
|
||||
@@ -67,13 +68,13 @@ export async function connectNats(natsUrl: string): Promise<void> {
|
||||
if (!js) return;
|
||||
const data = new TextEncoder().encode(JSON.stringify(payload));
|
||||
js.publish(subject, data).catch((err: Error) =>
|
||||
console.error(`[nats] publish failed for ${subject}: ${err.message}`),
|
||||
logger.error({ err, subject }, 'nats publish failed'),
|
||||
);
|
||||
});
|
||||
|
||||
console.log(`[nats] connected to ${natsUrl}, streams: ${STREAMS.map((s) => s.name).join(', ')}`);
|
||||
logger.info({ url: natsUrl, streams: STREAMS.map((s) => s.name) }, 'nats connected');
|
||||
} catch (err: any) {
|
||||
console.warn(`[nats] connection failed — running without JetStream: ${err.message}`);
|
||||
logger.warn({ err }, 'nats connection failed — running without JetStream');
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -1,7 +1,10 @@
|
||||
import 'dotenv/config';
|
||||
import { logger } from './logger.js';
|
||||
import express from 'express';
|
||||
import { pinoHttp } from 'pino-http';
|
||||
import cookieParser from 'cookie-parser';
|
||||
import cors from 'cors';
|
||||
import { tracingMiddleware } from './middleware/tracing.js';
|
||||
import { config } from './config.js';
|
||||
import { db, runMigrations } from './db/index.js';
|
||||
import { tipScores, tipFeedback } from './db/schema.js';
|
||||
@@ -12,7 +15,7 @@ import { integrationsRouter } from './routes/integrations.js';
|
||||
import { recommenderRouter } from './routes/recommender.js';
|
||||
import { userRouter } from './routes/user.js';
|
||||
import { pushRouter } from './routes/push.js';
|
||||
import { adminRouter } from './routes/admin.js';
|
||||
import { adminRouter, adminInternalRouter } from './routes/admin.js';
|
||||
import { mkdir } from 'fs/promises';
|
||||
import { dirname } from 'path';
|
||||
import { requireAuth } from './middleware/session.js';
|
||||
@@ -26,13 +29,11 @@ import { registerProfileSubscriptions } from './profile/subscriber.js';
|
||||
await mkdir(dirname(config.DATABASE_PATH), { recursive: true });
|
||||
runMigrations();
|
||||
|
||||
// Keep the API alive on stray async faults (e.g. a single bad admin route)
|
||||
// rather than dropping the whole process.
|
||||
process.on('unhandledRejection', (reason) => {
|
||||
console.error('[api] unhandledRejection', reason);
|
||||
logger.error({ err: reason }, 'unhandledRejection');
|
||||
});
|
||||
process.on('uncaughtException', (err) => {
|
||||
console.error('[api] uncaughtException', err);
|
||||
logger.fatal({ err }, 'uncaughtException');
|
||||
});
|
||||
|
||||
const app = express();
|
||||
@@ -43,6 +44,15 @@ app.use(
|
||||
credentials: true,
|
||||
}),
|
||||
);
|
||||
app.use(tracingMiddleware);
|
||||
app.use(
|
||||
pinoHttp({
|
||||
logger,
|
||||
genReqId: (req) => req.traceId,
|
||||
customProps: (req) => ({ traceId: req.traceId }),
|
||||
autoLogging: { ignore: (req) => req.url === '/health' },
|
||||
}),
|
||||
);
|
||||
app.use(express.json());
|
||||
app.use(cookieParser());
|
||||
app.use(sessionMiddleware);
|
||||
@@ -55,17 +65,15 @@ app.use('/api', recommenderRouter);
|
||||
app.use('/api/user', userRouter);
|
||||
app.use('/api/push', pushRouter);
|
||||
app.use('/api/admin', adminRouter);
|
||||
app.use('/api/admin', adminInternalRouter);
|
||||
|
||||
// Proxy ml/serving endpoints through the API (admin-only).
|
||||
// Allows admin UI to call /api/ml/stats/:userId, /api/ml/features/:userId
|
||||
// without needing direct access to the ml/serving port.
|
||||
app.use('/api/ml', requireAuth as any, requireAdmin as any, async (req: Request, res: Response) => {
|
||||
const mlUrl = config.ML_SERVING_URL;
|
||||
const target = `${mlUrl}${req.path}`;
|
||||
try {
|
||||
const upstream = await fetch(target, {
|
||||
method: req.method,
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
headers: { 'Content-Type': 'application/json', traceparent: req.traceparent },
|
||||
body: req.method !== 'GET' ? JSON.stringify(req.body) : undefined,
|
||||
signal: AbortSignal.timeout(5000),
|
||||
});
|
||||
@@ -82,7 +90,7 @@ async function purgeExpiredData() {
|
||||
await db.delete(tipScores).where(lt(tipScores.servedAt, cutoff));
|
||||
await db.delete(tipFeedback).where(lt(tipFeedback.createdAt, cutoff));
|
||||
} catch (err: any) {
|
||||
console.error(`[purge] retention cleanup failed: ${err.message}`);
|
||||
logger.error({ err }, 'retention cleanup failed');
|
||||
}
|
||||
}
|
||||
|
||||
@@ -90,7 +98,7 @@ purgeExpiredData();
|
||||
setInterval(purgeExpiredData, 24 * 60 * 60 * 1000);
|
||||
|
||||
app.listen(config.PORT, () => {
|
||||
console.log(`oO API listening on http://localhost:${config.PORT}`);
|
||||
logger.info({ port: config.PORT }, 'oO API listening');
|
||||
});
|
||||
|
||||
if (config.NATS_URL) {
|
||||
|
||||
12
services/api/src/logger.ts
Normal file
12
services/api/src/logger.ts
Normal file
@@ -0,0 +1,12 @@
|
||||
import pino from 'pino';
|
||||
import * as Sentry from '@sentry/node';
|
||||
|
||||
if (process.env['SENTRY_DSN']) {
|
||||
Sentry.init({
|
||||
dsn: process.env['SENTRY_DSN'],
|
||||
environment: process.env['NODE_ENV'] ?? 'development',
|
||||
});
|
||||
}
|
||||
|
||||
export const logger = pino({ level: process.env['LOG_LEVEL'] ?? 'info' });
|
||||
export { Sentry };
|
||||
26
services/api/src/middleware/tracing.ts
Normal file
26
services/api/src/middleware/tracing.ts
Normal file
@@ -0,0 +1,26 @@
|
||||
import { randomBytes } from 'crypto';
|
||||
import type { Request, Response, NextFunction } from 'express';
|
||||
|
||||
declare global {
|
||||
namespace Express {
|
||||
interface Request {
|
||||
traceId: string;
|
||||
traceparent: string;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
export function tracingMiddleware(req: Request, _res: Response, next: NextFunction): void {
|
||||
const incoming = req.headers['traceparent'] as string | undefined;
|
||||
let traceId: string;
|
||||
if (incoming) {
|
||||
const parts = incoming.split('-');
|
||||
traceId = parts.length === 4 && parts[1]?.length === 32 ? parts[1] : randomBytes(16).toString('hex');
|
||||
} else {
|
||||
traceId = randomBytes(16).toString('hex');
|
||||
}
|
||||
const parentId = randomBytes(8).toString('hex');
|
||||
req.traceId = traceId;
|
||||
req.traceparent = `00-${traceId}-${parentId}-01`;
|
||||
next();
|
||||
}
|
||||
@@ -4,7 +4,7 @@
|
||||
* A real Express app + in-memory SQLite DB per test suite.
|
||||
* Auth and admin middleware are mocked so we can focus on route logic.
|
||||
*/
|
||||
import { describe, it, expect, vi, beforeAll } from 'vitest';
|
||||
import { describe, it, expect, vi, beforeAll, afterEach } from 'vitest';
|
||||
import express from 'express';
|
||||
import * as http from 'http';
|
||||
import { makeTestDb } from '../../test/db.js';
|
||||
@@ -385,16 +385,126 @@ describe('GET /api/admin/events', () => {
|
||||
});
|
||||
});
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Health endpoint — mock fetch so tests don't depend on running services.
|
||||
// ---------------------------------------------------------------------------
|
||||
describe('GET /api/admin/health', () => {
|
||||
it('returns 200 with ok, services array, and checkedAt', async () => {
|
||||
const EXPECTED_HTTP_SERVICES = ['api', 'ml-serving', 'mlflow', 'airflow'] as const;
|
||||
const EXPECTED_INTERNAL = ['sqlite', 'event-bus'] as const;
|
||||
const VALID_STATUSES = new Set(['ok', 'degraded', 'down']);
|
||||
|
||||
type ServiceRow = { name: string; status: string; latencyMs: number };
|
||||
type HealthBody = { ok: boolean; services: ServiceRow[]; checkedAt: string };
|
||||
|
||||
function mockFetch(upServices: Set<string>) {
|
||||
// Resolve service name by port (matches defaults in config.ts).
|
||||
// Up services return HTTP 200; absent ones throw (simulates connection refused → 'down').
|
||||
vi.stubGlobal('fetch', async (url: string) => {
|
||||
const s = String(url);
|
||||
let name: string;
|
||||
if (s.includes(':8000')) name = 'ml-serving';
|
||||
else if (s.includes(':5000')) name = 'mlflow';
|
||||
else if (s.includes(':8080')) name = 'airflow';
|
||||
else name = 'api';
|
||||
|
||||
if (!upServices.has(name)) throw new Error(`ECONNREFUSED ${name}`);
|
||||
return { ok: true, json: async () => ({ ok: true, status: 'healthy' }) };
|
||||
});
|
||||
}
|
||||
|
||||
afterEach(() => vi.unstubAllGlobals());
|
||||
|
||||
it('shape: 200, typed fields, all expected services present', async () => {
|
||||
mockFetch(new Set(['api', 'ml-serving', 'mlflow', 'airflow']));
|
||||
const { server, call } = await startServer(buildApp());
|
||||
try {
|
||||
const { status, body } = await call('GET', '/api/admin/health');
|
||||
const b = body as { ok: boolean; services: { name: string; status: string }[]; checkedAt: string };
|
||||
const b = body as HealthBody;
|
||||
expect(status).toBe(200);
|
||||
expect(typeof b.ok).toBe('boolean');
|
||||
expect(Array.isArray(b.services)).toBe(true);
|
||||
expect(typeof b.checkedAt).toBe('string');
|
||||
expect(new Date(b.checkedAt).getTime()).toBeGreaterThan(0);
|
||||
|
||||
const names = b.services.map((s) => s.name);
|
||||
for (const svc of [...EXPECTED_HTTP_SERVICES, ...EXPECTED_INTERNAL]) {
|
||||
expect(names).toContain(svc);
|
||||
}
|
||||
for (const svc of b.services) {
|
||||
expect(VALID_STATUSES).toContain(svc.status);
|
||||
expect(typeof svc.latencyMs).toBe('number');
|
||||
}
|
||||
} finally {
|
||||
server.close();
|
||||
}
|
||||
});
|
||||
|
||||
it('ok=true when all HTTP services respond 200', async () => {
|
||||
mockFetch(new Set(['api', 'ml-serving', 'mlflow', 'airflow']));
|
||||
const { server, call } = await startServer(buildApp());
|
||||
try {
|
||||
const { body } = await call('GET', '/api/admin/health');
|
||||
const b = body as HealthBody;
|
||||
for (const name of EXPECTED_HTTP_SERVICES) {
|
||||
const svc = b.services.find((s) => s.name === name);
|
||||
expect(svc?.status, `${name} should be ok`).toBe('ok');
|
||||
}
|
||||
expect(b.ok).toBe(true);
|
||||
} finally {
|
||||
server.close();
|
||||
}
|
||||
});
|
||||
|
||||
it('ml-serving=down and ok=false when ml-serving is unreachable', async () => {
|
||||
mockFetch(new Set(['api', 'mlflow', 'airflow'])); // ml-serving absent
|
||||
const { server, call } = await startServer(buildApp());
|
||||
try {
|
||||
const { body } = await call('GET', '/api/admin/health');
|
||||
const b = body as HealthBody;
|
||||
const mlSvc = b.services.find((s) => s.name === 'ml-serving');
|
||||
expect(mlSvc?.status).toBe('down');
|
||||
expect(b.ok).toBe(false);
|
||||
} finally {
|
||||
server.close();
|
||||
}
|
||||
});
|
||||
|
||||
it('airflow=down and ok=false when airflow is unreachable', async () => {
|
||||
mockFetch(new Set(['api', 'ml-serving', 'mlflow'])); // airflow absent
|
||||
const { server, call } = await startServer(buildApp());
|
||||
try {
|
||||
const { body } = await call('GET', '/api/admin/health');
|
||||
const b = body as HealthBody;
|
||||
const svc = b.services.find((s) => s.name === 'airflow');
|
||||
expect(svc?.status).toBe('down');
|
||||
expect(b.ok).toBe(false);
|
||||
} finally {
|
||||
server.close();
|
||||
}
|
||||
});
|
||||
|
||||
it('mlflow=down and ok=false when mlflow is unreachable', async () => {
|
||||
mockFetch(new Set(['api', 'ml-serving', 'airflow'])); // mlflow absent
|
||||
const { server, call } = await startServer(buildApp());
|
||||
try {
|
||||
const { body } = await call('GET', '/api/admin/health');
|
||||
const b = body as HealthBody;
|
||||
const svc = b.services.find((s) => s.name === 'mlflow');
|
||||
expect(svc?.status).toBe('down');
|
||||
expect(b.ok).toBe(false);
|
||||
} finally {
|
||||
server.close();
|
||||
}
|
||||
});
|
||||
|
||||
it('sqlite and event-bus are always present regardless of HTTP service status', async () => {
|
||||
mockFetch(new Set()); // all HTTP services down
|
||||
const { server, call } = await startServer(buildApp());
|
||||
try {
|
||||
const { body } = await call('GET', '/api/admin/health');
|
||||
const b = body as HealthBody;
|
||||
expect(b.services.find((s) => s.name === 'sqlite')?.status).toBe('ok');
|
||||
expect(b.services.find((s) => s.name === 'event-bus')?.status).toBe('ok');
|
||||
} finally {
|
||||
server.close();
|
||||
}
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
import { type Router as ExpressRouter, Router, Response } from 'express';
|
||||
import { type Router as ExpressRouter, Router, Response, type Request } from 'express';
|
||||
import { logger } from '../logger.js';
|
||||
import { db, rawSqlite } from '../db/index.js';
|
||||
import {
|
||||
users,
|
||||
@@ -523,16 +524,24 @@ router.get('/data-quality', async (req: AuthenticatedRequest, res: Response) =>
|
||||
// Fan-out to all subsystem /health endpoints.
|
||||
// ---------------------------------------------------------------------------
|
||||
router.get('/health', async (_req: AuthenticatedRequest, res: Response) => {
|
||||
const checks: Array<{ name: string; url: string }> = [
|
||||
{ name: 'api', url: `http://localhost:${process.env.PORT ?? 3001}/health` },
|
||||
const airflowAuth = Buffer.from(`${config.AIRFLOW_API_USER}:${config.AIRFLOW_API_PASSWORD}`).toString('base64');
|
||||
|
||||
const checks: Array<{ name: string; url: string; headers?: Record<string, string> }> = [
|
||||
{ name: 'api', url: `http://localhost:${config.PORT}/health` },
|
||||
{ name: 'ml-serving', url: `${config.ML_SERVING_URL}/health` },
|
||||
{ name: 'mlflow', url: `${config.MLFLOW_URL}/health` },
|
||||
{ name: 'airflow', url: `${config.AIRFLOW_URL}/api/v1/health`,
|
||||
headers: { Authorization: `Basic ${airflowAuth}` } },
|
||||
];
|
||||
|
||||
const results = await Promise.allSettled(
|
||||
checks.map(async ({ name, url }) => {
|
||||
checks.map(async ({ name, url, headers }) => {
|
||||
const t0 = Date.now();
|
||||
try {
|
||||
const r = await fetch(url, { signal: AbortSignal.timeout(3000) });
|
||||
const r = await fetch(url, {
|
||||
headers,
|
||||
signal: AbortSignal.timeout(3000),
|
||||
});
|
||||
return { name, status: r.ok ? 'ok' : 'degraded', latencyMs: Date.now() - t0 };
|
||||
} catch {
|
||||
return { name, status: 'down', latencyMs: Date.now() - t0 };
|
||||
@@ -548,15 +557,12 @@ router.get('/health', async (_req: AuthenticatedRequest, res: Response) => {
|
||||
dbStatus = 'down';
|
||||
}
|
||||
|
||||
// Event bus: always ok if process is alive
|
||||
const eventBusStatus = 'ok';
|
||||
|
||||
const services = results.map((r) =>
|
||||
r.status === 'fulfilled' ? r.value : { name: 'unknown', status: 'down', latencyMs: 0 },
|
||||
);
|
||||
|
||||
services.push({ name: 'sqlite', status: dbStatus, latencyMs: 0 });
|
||||
services.push({ name: 'event-bus', status: eventBusStatus, latencyMs: 0 });
|
||||
services.push({ name: 'event-bus', status: 'ok', latencyMs: 0 });
|
||||
|
||||
const allOk = services.every((s) => s.status === 'ok');
|
||||
res.json({ ok: allOk, services, checkedAt: new Date().toISOString() });
|
||||
@@ -699,22 +705,21 @@ router.delete('/saved-queries/:id', async (req: AuthenticatedRequest, res: Respo
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// POST /api/admin/simulate/start
|
||||
// Spawn ml/experiments/sim/runner.py in the background; return run_id.
|
||||
// Trigger an Airflow DAG run (bandit_sim). Falls back to a local subprocess
|
||||
// when AIRFLOW_URL is not reachable, so local dev still works.
|
||||
// ---------------------------------------------------------------------------
|
||||
router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response) => {
|
||||
const {
|
||||
nUsers = 5,
|
||||
nRounds = 20,
|
||||
tasksPerRound = 8,
|
||||
useLlm = false,
|
||||
judgeMode = 'rule',
|
||||
policies = ['linucb-v1', 'egreedy-v1'],
|
||||
} = req.body as {
|
||||
nUsers?: number;
|
||||
nRounds?: number;
|
||||
tasksPerRound?: number;
|
||||
useLlm?: boolean;
|
||||
judgeMode?: 'rule' | 'llm' | 'claude-code';
|
||||
judgeMode?: 'rule' | 'llm';
|
||||
policies?: string[];
|
||||
};
|
||||
|
||||
@@ -733,17 +738,69 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
|
||||
nUsers,
|
||||
nRounds,
|
||||
tasksPerRound,
|
||||
useLlm,
|
||||
useLlm: judgeMode === 'llm',
|
||||
judgeMode,
|
||||
nPolicies: policies.length,
|
||||
status: 'running',
|
||||
createdAt: now,
|
||||
});
|
||||
|
||||
// ── Try Airflow first ────────────────────────────────────────────────────
|
||||
if (config.AIRFLOW_URL && config.INTERNAL_API_TOKEN) {
|
||||
try {
|
||||
const airflowAuth = Buffer.from(
|
||||
`${config.AIRFLOW_API_USER}:${config.AIRFLOW_API_PASSWORD}`,
|
||||
).toString('base64');
|
||||
|
||||
const dagRes = await fetch(
|
||||
`${config.AIRFLOW_URL}/api/v1/dags/bandit_sim/dagRuns`,
|
||||
{
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
Authorization: `Basic ${airflowAuth}`,
|
||||
},
|
||||
body: JSON.stringify({
|
||||
conf: {
|
||||
sim_run_id: id,
|
||||
n_users: nUsers,
|
||||
n_rounds: nRounds,
|
||||
tasks_per_round: tasksPerRound,
|
||||
policies,
|
||||
judge_mode: judgeMode,
|
||||
ml_url: config.ML_SERVING_URL,
|
||||
mlflow_url: config.MLFLOW_URL,
|
||||
callback_url: `${config.API_BASE_URL}/api/admin/simulate/${id}/complete`,
|
||||
internal_token: config.INTERNAL_API_TOKEN,
|
||||
},
|
||||
}),
|
||||
signal: AbortSignal.timeout(5000),
|
||||
},
|
||||
);
|
||||
|
||||
if (dagRes.ok) {
|
||||
const dagBody = await dagRes.json() as { dag_run_id: string };
|
||||
await db
|
||||
.update(simRuns)
|
||||
.set({ airflowDagRunId: dagBody.dag_run_id })
|
||||
.where(eq(simRuns.id, id));
|
||||
|
||||
res.json({ id, status: 'running', airflow_dag_run_id: dagBody.dag_run_id });
|
||||
return;
|
||||
}
|
||||
logger.warn({ status: dagRes.status }, 'sim: Airflow trigger failed, falling back to subprocess');
|
||||
} catch (err) {
|
||||
logger.warn({ err }, 'sim: Airflow unreachable, falling back to subprocess');
|
||||
}
|
||||
}
|
||||
|
||||
// ── Subprocess fallback (local dev / Airflow not configured) ────────────
|
||||
const runnerPath = resolve(__dirname, '../../../../ml/experiments/sim/runner.py');
|
||||
const venvPython = resolve(__dirname, '../../../../ml/serving/.venv/bin/python');
|
||||
const pythonBin = existsSync(venvPython) ? venvPython : 'python3';
|
||||
const outPath = `/tmp/oo-sim-${id}.json`;
|
||||
|
||||
const args = [
|
||||
const child = spawn(pythonBin, [
|
||||
runnerPath,
|
||||
'--n-users', String(nUsers),
|
||||
'--n-rounds', String(nRounds),
|
||||
@@ -751,32 +808,22 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
|
||||
'--ml-url', config.ML_SERVING_URL,
|
||||
'--policies', ...policies,
|
||||
'--out', outPath,
|
||||
'--judge', judgeMode === 'llm' ? 'llm' : judgeMode === 'claude-code' ? 'rule' : 'rule',
|
||||
// claude-code mode isn't auto-runnable from the API (requires human in the loop)
|
||||
// it falls back to rule judge when triggered from the panel
|
||||
];
|
||||
'--judge', judgeMode,
|
||||
'--mlflow-url', config.MLFLOW_URL,
|
||||
'--mlflow-experiment', 'bandit_simulation',
|
||||
], { stdio: ['ignore', 'pipe', 'pipe'] });
|
||||
|
||||
const child = spawn(pythonBin, args, { stdio: ['ignore', 'pipe', 'pipe'] });
|
||||
if (child.pid) _simProcesses.set(id, { pid: child.pid, startedAt: now });
|
||||
|
||||
if (child.pid) {
|
||||
_simProcesses.set(id, { pid: child.pid, startedAt: now });
|
||||
}
|
||||
|
||||
// Without this listener, a spawn failure (ENOENT when python3 is absent
|
||||
// — e.g. in the alpine api container) would emit an unhandled 'error' event
|
||||
// and crash the whole API process.
|
||||
child.on('error', async (err) => {
|
||||
console.error('[sim] spawn error', err);
|
||||
logger.error({ err }, 'sim: spawn error');
|
||||
_simProcesses.delete(id);
|
||||
await db
|
||||
.update(simRuns)
|
||||
await db.update(simRuns)
|
||||
.set({ status: 'failed', finishedAt: new Date().toISOString() })
|
||||
.where(eq(simRuns.id, id));
|
||||
});
|
||||
|
||||
// Capture stderr for debugging
|
||||
const stderrLines: string[] = [];
|
||||
child.stderr?.on('data', (d: Buffer) => stderrLines.push(d.toString()));
|
||||
child.stderr?.on('data', (d: Buffer) => logger.debug({ stderr: d.toString() }, 'sim stderr'));
|
||||
|
||||
child.on('exit', async (code) => {
|
||||
_simProcesses.delete(id);
|
||||
@@ -785,8 +832,6 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
|
||||
if (code === 0 && existsSync(outPath)) {
|
||||
try {
|
||||
const raw = JSON.parse(readFileSync(outPath, 'utf-8'));
|
||||
|
||||
// Bulk-insert sim events
|
||||
const eventRows = (raw.events ?? []).map((ev: Record<string, unknown>) => ({
|
||||
id: nanoid(),
|
||||
runId: id,
|
||||
@@ -804,21 +849,19 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
|
||||
dayOfWeek: Number(ev.day_of_week),
|
||||
createdAt: now,
|
||||
}));
|
||||
|
||||
for (const row of eventRows) {
|
||||
await db.insert(simEvents).values(row).catch(() => {});
|
||||
}
|
||||
|
||||
await db.update(simRuns).set({
|
||||
status: 'done',
|
||||
summaryJson: JSON.stringify(raw.summary),
|
||||
winner: raw.winner,
|
||||
personaBreakdownJson: JSON.stringify(raw.persona_breakdown),
|
||||
mlflowRunId: raw.mlflow_run_id ?? null,
|
||||
finishedAt,
|
||||
}).where(eq(simRuns.id, id));
|
||||
|
||||
try { unlinkSync(outPath); } catch { /* ignore */ }
|
||||
} catch (e) {
|
||||
} catch {
|
||||
await db.update(simRuns).set({ status: 'failed', finishedAt }).where(eq(simRuns.id, id));
|
||||
}
|
||||
} else {
|
||||
@@ -863,4 +906,68 @@ router.get('/simulate/:id', async (req: AuthenticatedRequest, res: Response) =>
|
||||
res.json({ run: { ...run, isRunning }, events });
|
||||
});
|
||||
|
||||
export { router as adminRouter };
|
||||
// ---------------------------------------------------------------------------
|
||||
// internalRouter — no session auth; only INTERNAL_API_TOKEN header check.
|
||||
// Mounted separately in index.ts at /api/admin to avoid router.use() auth.
|
||||
// ---------------------------------------------------------------------------
|
||||
const internalRouter: ExpressRouter = Router();
|
||||
|
||||
internalRouter.post('/simulate/:id/complete', async (req: Request, res: Response) => {
|
||||
const token = req.headers['x-internal-token'];
|
||||
if (!config.INTERNAL_API_TOKEN || token !== config.INTERNAL_API_TOKEN) {
|
||||
res.status(401).json({ error: 'Unauthorized' });
|
||||
return;
|
||||
}
|
||||
|
||||
const { id } = req.params as { id: string };
|
||||
const { summary, winner, persona_breakdown, events: rawEvents, mlflow_run_id } =
|
||||
req.body as {
|
||||
summary: Record<string, unknown>;
|
||||
winner: string;
|
||||
persona_breakdown: Record<string, unknown>;
|
||||
events: Record<string, unknown>[];
|
||||
mlflow_run_id?: string;
|
||||
};
|
||||
|
||||
const finishedAt = new Date().toISOString();
|
||||
const now = finishedAt;
|
||||
|
||||
try {
|
||||
const eventRows = (rawEvents ?? []).map((ev) => ({
|
||||
id: nanoid(),
|
||||
runId: id,
|
||||
round: Number(ev['round']),
|
||||
userId: String(ev['user_id']),
|
||||
persona: String(ev['persona']),
|
||||
policy: String(ev['policy']),
|
||||
tipContent: String(ev['tip_content']),
|
||||
priority: Number(ev['priority']),
|
||||
isOverdue: Boolean(ev['is_overdue']),
|
||||
action: String(ev['action']),
|
||||
dwellMs: ev['dwell_ms'] != null ? Number(ev['dwell_ms']) : null,
|
||||
rewardMilli: Math.round(Number(ev['reward']) * 1000),
|
||||
hour: Number(ev['hour']),
|
||||
dayOfWeek: Number(ev['day_of_week']),
|
||||
createdAt: now,
|
||||
}));
|
||||
for (const row of eventRows) {
|
||||
await db.insert(simEvents).values(row).catch(() => {});
|
||||
}
|
||||
await db.update(simRuns).set({
|
||||
status: 'done',
|
||||
summaryJson: JSON.stringify(summary),
|
||||
winner,
|
||||
personaBreakdownJson: JSON.stringify(persona_breakdown),
|
||||
mlflowRunId: mlflow_run_id ?? null,
|
||||
finishedAt,
|
||||
}).where(eq(simRuns.id, id));
|
||||
|
||||
res.json({ ok: true });
|
||||
} catch (err) {
|
||||
logger.error({ err }, 'sim: complete callback failed');
|
||||
await db.update(simRuns).set({ status: 'failed', finishedAt }).where(eq(simRuns.id, id));
|
||||
res.status(500).json({ error: 'Failed to store results' });
|
||||
}
|
||||
});
|
||||
|
||||
export { router as adminRouter, internalRouter as adminInternalRouter };
|
||||
|
||||
@@ -5,6 +5,7 @@ import { db } from '../db/index.js';
|
||||
import { users, sessions } from '../db/schema.js';
|
||||
import { eq } from 'drizzle-orm';
|
||||
import { config } from '../config.js';
|
||||
import { logger } from '../logger.js';
|
||||
|
||||
const router: ExpressRouter = Router();
|
||||
|
||||
@@ -36,7 +37,7 @@ router.get('/login', async (req: Request, res: Response) => {
|
||||
setTimeout(() => pendingStates.delete(state), 10 * 60 * 1000);
|
||||
|
||||
const redirectUri = `${config.API_BASE_URL}/api/auth/callback`;
|
||||
console.log('[auth] redirect_uri sent to Google:', redirectUri);
|
||||
logger.info({ redirectUri }, 'auth: redirect_uri');
|
||||
const authUrl = client.buildAuthorizationUrl(cfg, {
|
||||
redirect_uri: redirectUri,
|
||||
scope: 'openid email profile',
|
||||
@@ -72,7 +73,7 @@ router.get('/callback', async (req: Request, res: Response) => {
|
||||
expectedState: state,
|
||||
});
|
||||
} catch (err) {
|
||||
console.error('OAuth callback error', err);
|
||||
logger.error({ err }, 'auth: OAuth callback error');
|
||||
res.status(400).json({ error: 'OAuth error' });
|
||||
return;
|
||||
}
|
||||
@@ -123,6 +124,45 @@ router.get('/callback', async (req: Request, res: Response) => {
|
||||
.redirect(`${config.WEB_BASE_URL}${pending.redirectTo}`);
|
||||
});
|
||||
|
||||
/**
|
||||
* POST /api/auth/token
|
||||
* Exchange the static ADMIN_TOKEN for a session cookie.
|
||||
* Finds the first admin user in the DB; rejects if ADMIN_TOKEN is not configured.
|
||||
*/
|
||||
router.post('/token', async (req: Request, res: Response) => {
|
||||
const { token } = req.body as { token?: string };
|
||||
if (!config.ADMIN_TOKEN || !token || token !== config.ADMIN_TOKEN) {
|
||||
res.status(401).json({ error: 'Invalid token' });
|
||||
return;
|
||||
}
|
||||
|
||||
const [adminUser] = await db
|
||||
.select()
|
||||
.from(users)
|
||||
.where(eq(users.role, 'admin'))
|
||||
.limit(1);
|
||||
|
||||
if (!adminUser) {
|
||||
res.status(403).json({ error: 'No admin user exists' });
|
||||
return;
|
||||
}
|
||||
|
||||
const sid = nanoid(32);
|
||||
const now = new Date().toISOString();
|
||||
const expiresAt = new Date(Date.now() + 24 * 60 * 60 * 1000).toISOString();
|
||||
await db.insert(sessions).values({ id: sid, userId: adminUser.id, expiresAt, createdAt: now });
|
||||
|
||||
res
|
||||
.cookie('sid', sid, {
|
||||
httpOnly: true,
|
||||
secure: config.NODE_ENV === 'production',
|
||||
sameSite: 'lax',
|
||||
expires: new Date(expiresAt),
|
||||
path: '/',
|
||||
})
|
||||
.json({ ok: true });
|
||||
});
|
||||
|
||||
/** POST /api/auth/logout */
|
||||
router.post('/logout', async (req: Request, res: Response) => {
|
||||
const sid = req.cookies?.sid as string | undefined;
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
import { type Router as ExpressRouter, Router, Response } from 'express';
|
||||
import { nanoid } from 'nanoid';
|
||||
import { logger } from '../logger.js';
|
||||
import { db } from '../db/index.js';
|
||||
import { integrationTokens, tipFeedback, tipViews, tipScores } from '../db/schema.js';
|
||||
import { eq, and, desc } from 'drizzle-orm';
|
||||
@@ -47,7 +48,8 @@ export const _clearCandidateCacheForTests = () => {
|
||||
// Shadow-policy registry
|
||||
// ---------------------------------------------------------------------------
|
||||
const shadowPolicies = new Map<string, { active: boolean }>([
|
||||
// egreedy-v2 (D=12, profile features) — disabled until sim gate per ADR-0012
|
||||
// egreedy-v2 promoted to active policy (ADR-0012). Shadow entry kept for
|
||||
// rollback toggle; leave disabled in normal operation.
|
||||
['egreedy-v2-shadow', { active: false }],
|
||||
]);
|
||||
|
||||
@@ -84,6 +86,7 @@ async function remotePolicy(
|
||||
userId: string,
|
||||
tasks: TipCandidate[],
|
||||
profile: Profile,
|
||||
traceparent?: string,
|
||||
): Promise<{ tipId: string; score: number; policy: string } | null> {
|
||||
const hour = new Date().getHours();
|
||||
const dayOfWeek = new Date().getDay();
|
||||
@@ -101,17 +104,16 @@ async function remotePolicy(
|
||||
profile_features: profile,
|
||||
};
|
||||
|
||||
// Active policy: egreedy-v1 (selected over linucb-v1 after offline sim — ADR-0007)
|
||||
try {
|
||||
const res = await fetch(`${config.ML_SERVING_URL}/score/egreedy`, {
|
||||
const res = await fetch(`${config.ML_SERVING_URL}/score/egreedy/v2`, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
|
||||
body: JSON.stringify(body),
|
||||
signal: AbortSignal.timeout(3000),
|
||||
});
|
||||
if (!res.ok) return null;
|
||||
const data = (await res.json()) as { tip_id: string; score: number };
|
||||
return { tipId: data.tip_id, score: data.score, policy: 'egreedy-v1' };
|
||||
return { tipId: data.tip_id, score: data.score, policy: 'egreedy-v2' };
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
@@ -145,6 +147,7 @@ async function fetchLlmCandidates(
|
||||
dayOfWeek: number,
|
||||
promptVersion: string | null,
|
||||
profile: Profile,
|
||||
traceparent?: string,
|
||||
): Promise<LlmGenerateResult> {
|
||||
try {
|
||||
const tasks = signals.slice(0, 10).map((s) => ({
|
||||
@@ -155,7 +158,7 @@ async function fetchLlmCandidates(
|
||||
}));
|
||||
const res = await fetch(`${config.ML_SERVING_URL}/generate`, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
|
||||
body: JSON.stringify({
|
||||
user_id: userId,
|
||||
context: { tasks, hour_of_day: hour, day_of_week: dayOfWeek },
|
||||
@@ -225,6 +228,7 @@ router.post('/recommend', requireAuth, async (req: AuthenticatedRequest, res: Re
|
||||
dayOfWeek,
|
||||
requestedPromptVersion,
|
||||
profile,
|
||||
req.traceparent,
|
||||
);
|
||||
|
||||
const allCandidates: TipCandidate[] = [...signalCandidates, ...llmResult.candidates];
|
||||
@@ -239,7 +243,7 @@ router.post('/recommend', requireAuth, async (req: AuthenticatedRequest, res: Re
|
||||
const t0 = Date.now();
|
||||
|
||||
// Stage 2: score — egreedy bandit with random fallback
|
||||
const scored = await remotePolicy(req.userId!, allCandidates, profile);
|
||||
const scored = await remotePolicy(req.userId!, allCandidates, profile, req.traceparent);
|
||||
const latencyMs = Date.now() - t0;
|
||||
const tip = scored
|
||||
? (allCandidates.find((t) => t.id === scored.tipId) ?? randomPolicy(allCandidates))
|
||||
@@ -371,6 +375,8 @@ async function sendRewardWithRetry(
|
||||
tipId: string,
|
||||
reward: number,
|
||||
features: TipCandidate['features'],
|
||||
profile: Profile,
|
||||
traceparent?: string,
|
||||
): Promise<void> {
|
||||
const body = JSON.stringify({
|
||||
user_id: userId,
|
||||
@@ -378,13 +384,14 @@ async function sendRewardWithRetry(
|
||||
reward,
|
||||
features,
|
||||
day_of_week: new Date().getDay(),
|
||||
profile_features: profile,
|
||||
});
|
||||
|
||||
for (let attempt = 1; attempt <= 3; attempt++) {
|
||||
try {
|
||||
const res = await fetch(`${config.ML_SERVING_URL}/reward/egreedy`, {
|
||||
const res = await fetch(`${config.ML_SERVING_URL}/reward/egreedy/v2`, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
|
||||
body,
|
||||
signal: AbortSignal.timeout(3000),
|
||||
});
|
||||
@@ -392,7 +399,7 @@ async function sendRewardWithRetry(
|
||||
throw new Error(`HTTP ${res.status}`);
|
||||
} catch (err: any) {
|
||||
if (attempt === 3) {
|
||||
console.error(`[reward] failed after 3 attempts for tip ${tipId}: ${err.message}`);
|
||||
logger.error({ tipId, err }, 'reward: failed after 3 attempts');
|
||||
bus.publish('signals.tip.reward_failed', {
|
||||
userId,
|
||||
tipId,
|
||||
@@ -463,7 +470,9 @@ router.post('/tip/:id/feedback', requireAuth, async (req: AuthenticatedRequest,
|
||||
});
|
||||
|
||||
if (candidate) {
|
||||
sendRewardWithRetry(req.userId!, tipId, reward, candidate.features);
|
||||
// Re-fetch profile for the v2 ridge update; TTL cache makes this near-instant.
|
||||
const profile = await getProfile(req.userId!);
|
||||
sendRewardWithRetry(req.userId!, tipId, reward, candidate.features, profile, req.traceparent);
|
||||
}
|
||||
|
||||
// Delegate action to the owning signal source (e.g. mark done in Todoist)
|
||||
|
||||
@@ -8,6 +8,11 @@
|
||||
*/
|
||||
import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
|
||||
|
||||
vi.mock('../../logger.js', () => ({
|
||||
logger: { info: vi.fn(), warn: vi.fn(), error: vi.fn(), fatal: vi.fn() },
|
||||
}));
|
||||
import { logger } from '../../logger.js';
|
||||
|
||||
// ── mock the drizzle query chain: db.select(...).from(...).where(...) ────────
|
||||
let users: { userId: string }[] = [];
|
||||
const whereMock = vi.fn(async () => users);
|
||||
@@ -35,6 +40,7 @@ beforeEach(() => {
|
||||
whereMock.mockClear();
|
||||
fromMock.mockClear();
|
||||
selectMock.mockClear();
|
||||
vi.clearAllMocks();
|
||||
vi.useFakeTimers();
|
||||
});
|
||||
|
||||
@@ -102,8 +108,6 @@ describe('startTodoistSyncScheduler', () => {
|
||||
if (id === 'bad') throw new Error('todoist 401');
|
||||
return [];
|
||||
});
|
||||
const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
|
||||
const logSpy = vi.spyOn(console, 'log').mockImplementation(() => {});
|
||||
|
||||
startTodoistSyncScheduler(60_000);
|
||||
await vi.advanceTimersByTimeAsync(10_001);
|
||||
@@ -112,19 +116,27 @@ describe('startTodoistSyncScheduler', () => {
|
||||
await Promise.resolve();
|
||||
|
||||
expect(fetchSignalsMock).toHaveBeenCalledTimes(3);
|
||||
expect(errSpy).toHaveBeenCalledWith(expect.stringContaining('sync error'), expect.anything());
|
||||
expect(logSpy).toHaveBeenCalledWith(expect.stringContaining('2 ok, 1 failed'));
|
||||
expect(logger.error).toHaveBeenCalledWith(
|
||||
expect.objectContaining({ err: expect.anything() }),
|
||||
'scheduler: sync error',
|
||||
);
|
||||
expect(logger.info).toHaveBeenCalledWith(
|
||||
expect.objectContaining({ ok: 2, failed: 1 }),
|
||||
'scheduler: todoist sync',
|
||||
);
|
||||
});
|
||||
|
||||
it('survives a db query failure — logs and skips the tick', async () => {
|
||||
const { startTodoistSyncScheduler } = await import('../scheduler.js');
|
||||
whereMock.mockRejectedValueOnce(new Error('sqlite locked'));
|
||||
const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
|
||||
|
||||
startTodoistSyncScheduler(60_000);
|
||||
await vi.advanceTimersByTimeAsync(10_001);
|
||||
|
||||
expect(fetchSignalsMock).not.toHaveBeenCalled();
|
||||
expect(errSpy).toHaveBeenCalledWith(expect.stringContaining('failed to query users'));
|
||||
expect(logger.error).toHaveBeenCalledWith(
|
||||
expect.objectContaining({ err: expect.anything() }),
|
||||
'scheduler: failed to query users',
|
||||
);
|
||||
});
|
||||
});
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
import type { Signal, SignalSource } from '@oo/shared-types';
|
||||
import { logger } from '../logger.js';
|
||||
|
||||
/**
|
||||
* Merges signals from all registered sources for a user.
|
||||
@@ -24,7 +25,7 @@ export class SignalAggregator {
|
||||
if (r.status === 'fulfilled') {
|
||||
signals.push(...r.value);
|
||||
} else {
|
||||
console.error(`[aggregator] source '${this.sources[i].id}' failed:`, r.reason);
|
||||
logger.error({ sourceId: this.sources[i]!.id, err: r.reason }, 'aggregator: source failed');
|
||||
}
|
||||
}
|
||||
return signals;
|
||||
|
||||
@@ -13,6 +13,7 @@ import { db } from '../db/index.js';
|
||||
import { integrationTokens } from '../db/schema.js';
|
||||
import { eq } from 'drizzle-orm';
|
||||
import { todoistSource } from './todoist.js';
|
||||
import { logger } from '../logger.js';
|
||||
|
||||
const DEFAULT_INTERVAL_MS = 15 * 60 * 1000;
|
||||
|
||||
@@ -25,7 +26,7 @@ export function startTodoistSyncScheduler(intervalMs = DEFAULT_INTERVAL_MS): Nod
|
||||
.from(integrationTokens)
|
||||
.where(eq(integrationTokens.tokenStatus, 'active'));
|
||||
} catch (err: any) {
|
||||
console.error(`[scheduler] failed to query users: ${err.message}`);
|
||||
logger.error({ err }, 'scheduler: failed to query users');
|
||||
return;
|
||||
}
|
||||
|
||||
@@ -39,10 +40,10 @@ export function startTodoistSyncScheduler(intervalMs = DEFAULT_INTERVAL_MS): Nod
|
||||
let failed = 0;
|
||||
for (const r of results) {
|
||||
if (r.status === 'fulfilled') ok++;
|
||||
else { failed++; console.error(`[scheduler] sync error:`, r.reason); }
|
||||
else { failed++; logger.error({ err: r.reason }, 'scheduler: sync error'); }
|
||||
}
|
||||
|
||||
console.log(`[scheduler] todoist sync: ${ok} ok, ${failed} failed (${users.length} users)`);
|
||||
logger.info({ ok, failed, total: users.length }, 'scheduler: todoist sync');
|
||||
}
|
||||
|
||||
// Run once shortly after startup, then on interval
|
||||
|
||||
@@ -3,6 +3,7 @@ import { db } from '../db/index.js';
|
||||
import { integrationTokens } from '../db/schema.js';
|
||||
import { eq, and } from 'drizzle-orm';
|
||||
import { bus } from '../events/bus.js';
|
||||
import { logger } from '../logger.js';
|
||||
|
||||
const CACHE_TTL_MS = 30_000;
|
||||
|
||||
@@ -46,7 +47,7 @@ export class TodoistSignalSource implements SignalSource {
|
||||
|
||||
if (!res.ok) {
|
||||
if (res.status === 401) {
|
||||
console.error(`[todoist] token expired for user ${userId}`);
|
||||
logger.warn({ userId }, 'todoist: token expired');
|
||||
bus.publish('signals.integration.token_expired', {
|
||||
userId,
|
||||
provider: 'todoist',
|
||||
@@ -88,7 +89,7 @@ export class TodoistSignalSource implements SignalSource {
|
||||
});
|
||||
|
||||
this.cache.set(userId, { signals, fetchedAt: Date.now() });
|
||||
bus.publish('signals.task.synced', { userId, count: signals.length, syncedAt: now });
|
||||
bus.publish('signals.task.synced', { userId, source: 'todoist', count: signals.length, syncedAt: now });
|
||||
|
||||
return signals;
|
||||
}
|
||||
|
||||
@@ -2,30 +2,49 @@
|
||||
|
||||
Third-party connectors and the token vault.
|
||||
|
||||
## Connector interface
|
||||
## Signal source interface
|
||||
|
||||
Each connector implements `SignalSource` from `@oo/shared-types`:
|
||||
|
||||
```ts
|
||||
interface Connector {
|
||||
id: string // e.g. "todoist"
|
||||
scopes: string[] // human-readable list shown in consent UI
|
||||
beginOAuth(user): Promise<{ redirectUrl, state }>
|
||||
finishOAuth(code, state): Promise<StoredCredential>
|
||||
fetchSignals(user, since?): AsyncIterable<NormalizedEvent>
|
||||
// incremental-sync cursor (Todoist sync_token, webhook timestamps, etc.)
|
||||
// stored in Credential.meta; the connector owns its shape.
|
||||
act?(user, action): Promise<void> // optional write-back (complete task, etc.)
|
||||
revoke(user): Promise<void> // REQUIRED: provider-side token revocation on disconnect
|
||||
interface SignalSource {
|
||||
readonly id: string // e.g. "todoist"
|
||||
fetchSignals(userId: string): Promise<Signal[]> // returns normalized Signal[]
|
||||
act?(userId: string, signalId: string, action: string): Promise<void> // optional write-back
|
||||
}
|
||||
```
|
||||
|
||||
`SignalAggregator` (`services/api/src/signals/aggregator.ts`) fans out to all registered sources in parallel, isolating per-source failures.
|
||||
|
||||
## Token vault
|
||||
|
||||
- Credentials encrypted at rest (libsodium sealed box); key from env/KMS.
|
||||
- Refresh handled transparently; consumers never see raw tokens.
|
||||
- One row per `(user, provider)` with provider-specific `meta`.
|
||||
OAuth tokens stored in the `integration_tokens` SQLite table (`services/api/src/db/schema.ts`):
|
||||
|
||||
## Roadmap
|
||||
| Column | Description |
|
||||
|--------|-------------|
|
||||
| `userId` | owner |
|
||||
| `provider` | e.g. `todoist` |
|
||||
| `accessToken` | OAuth access token (plain in dev; encrypted in prod via server secret store) |
|
||||
| `tokenStatus` | `active` \| `needs_reconnect` |
|
||||
|
||||
- Phase 0: **Todoist** (OAuth2, read tasks, complete task).
|
||||
- Phase 2: Google Calendar, Apple Health (web import), generic webhook ingress.
|
||||
- Phase 5: public SDK so third parties can ship connectors.
|
||||
On a 401 from the upstream API, the connector marks the token `needs_reconnect` and publishes `signals.integration.token_expired` so the client can prompt re-auth.
|
||||
|
||||
## Routes
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------|-------------|
|
||||
| `GET` | `/api/integrations` | List connected integrations for current user |
|
||||
| `GET` | `/api/integrations/todoist/connect` | Start Todoist OAuth flow |
|
||||
| `GET` | `/api/integrations/todoist/callback` | OAuth callback — exchange code, store token |
|
||||
| `DELETE` | `/api/integrations/:provider` | Disconnect + delete token |
|
||||
|
||||
## Connectors
|
||||
|
||||
| Connector | Status | Signals produced |
|
||||
|-----------|--------|-----------------|
|
||||
| Todoist | Phase 1 — active | `task` signals (today + overdue); `done` write-back |
|
||||
| Google Calendar | Phase 2 — planned | `event` signals |
|
||||
|
||||
## Extraction criteria
|
||||
|
||||
Extract to its own process when credential blast-radius isolation requires it (e.g. token vault with KMS-backed encryption needs to run in a hardened sidecar) or when connector volume justifies separate scaling.
|
||||
|
||||
@@ -1,29 +1,42 @@
|
||||
# recommender
|
||||
|
||||
The core of oO. Takes a user + a context, returns **one** tip.
|
||||
The core of oO. Takes a user + context, returns **one** tip.
|
||||
|
||||
## Contract
|
||||
|
||||
```
|
||||
POST /recommend
|
||||
{ user_id, context?: { time, timezone, client, ... } }
|
||||
→ { tip: { id, kind: "todo"|"advice", title, body, source, deep_link, meta } }
|
||||
POST /api/recommend
|
||||
{ } (user inferred from session)
|
||||
→ { tip: { id, content, source, kind, sourceId?, rationale?, createdAt } }
|
||||
|
||||
POST /feedback
|
||||
{ user_id, tip_id, reaction: "done"|"snooze"|"dismiss", at }
|
||||
POST /api/tip/:id/feedback
|
||||
{ action: "done"|"dismiss"|"snooze"|"helpful"|"not_helpful", dwellMs? }
|
||||
→ { ok: true }
|
||||
```
|
||||
|
||||
## Internals (stable seams)
|
||||
## Pipeline
|
||||
|
||||
- **Candidate sources** — pluggable async generators. v0: Todoist tasks via `integrations`. Later: advice library, calendar nudges, health prompts.
|
||||
- **Feature assembler** — fills the `context` blob (inline in Phase 0; calls feature store from M1). Never inlined into policy code.
|
||||
- **Policy registry** — `Policy.pick(candidates, context) → tip`. Named entries:
|
||||
- `random` — v0 (Phase 0).
|
||||
- `bandit.linucb.pooled` — v1 (Phase 1). **Global-then-personalize**: pooled features shared across users; per-user residual once data allows.
|
||||
- `remote` — delegates to `ml/serving` FastAPI scorer (Phase 1+).
|
||||
- **Shadow hook** — every request optionally runs N shadow policies in parallel and logs their picks + estimated rewards. Promotion from shadow → A/B → launch is a separate, deliberate step (ADR-0002).
|
||||
- **TipInstance persistence** — every decision writes `context_snapshot` (features seen at decision time). This is what makes offline replay honest.
|
||||
1. **Signals** — `SignalAggregator.fetchAll(userId)` fans out to all registered `SignalSource` implementations in parallel. Currently: `TodoistSignalSource`. Add a source via `aggregator.register(new MySource())`.
|
||||
2. **LLM candidates** — `POST /generate` on `ml/serving` returns `TipCandidate[]` from the `tip-generator` LiteLLM alias.
|
||||
3. **Scoring** — all candidates sent to `ml/serving` active policy (`POST /score/egreedy`). Falls back to random if `ml/serving` is unreachable.
|
||||
4. **Shadow policies** — active policy runs shadow policies in the same request for offline comparison (ADR-0002). Currently: `egreedy-v2` shadows `egreedy-v1`.
|
||||
5. **Persistence** — `tipViews` + `tipScores` rows written on every serve; `tipFeedback` row on reaction.
|
||||
6. **Reward delivery** — reaction triggers `POST /reward/egreedy` on `ml/serving` with inferred reward value.
|
||||
|
||||
## Phase 0 goal
|
||||
## Signal normalization
|
||||
|
||||
`RandomPolicy` only. The service, contract, registry, shadow hook, and tip-instance persistence all exist; no ML yet.
|
||||
Signals carry `features: Record<string, number | boolean>` (bandit-ready) and `metadata: Record<string, unknown>` (source-specific raw fields). The bandit treats features as an opaque dict — sources own their feature names. See ADR-0009.
|
||||
|
||||
## Policy registry
|
||||
|
||||
| Policy | Status | Notes |
|
||||
|--------|--------|-------|
|
||||
| `random` | Fallback | Used when ml/serving is unreachable |
|
||||
| `egreedy-v1` | Shadow | d=7, ADR-0007 |
|
||||
| `egreedy-v2` | **Active** | d=12 + profile features, ADR-0012 |
|
||||
|
||||
Shadow → active promotion requires offline sim + online agreement (ADR-0002).
|
||||
|
||||
## Extraction criteria
|
||||
|
||||
Extract to its own process at scaling hotspot: when `POST /recommend` p99 latency exceeds SLA or when recommendation CPU displaces API serving on shared host.
|
||||
|
||||
Reference in New Issue
Block a user