Compare commits
14 Commits
2d7cf217a9
...
e40dfdcbb0
| Author | SHA1 | Date | |
|---|---|---|---|
| e40dfdcbb0 | |||
| bad1bb2cba | |||
| e96ceb7ee1 | |||
| b554970032 | |||
| c4960d0601 | |||
| 7281af83a4 | |||
| cba3f1a184 | |||
| 352469162d | |||
| 45416000f9 | |||
| bd3ea1b8b1 | |||
| 377373a95d | |||
| d539fde0c1 | |||
| f48b5a7646 | |||
| 4652e4b582 |
19
.dockerignore
Normal file
19
.dockerignore
Normal file
@@ -0,0 +1,19 @@
|
|||||||
|
**/node_modules
|
||||||
|
**/.next
|
||||||
|
**/dist
|
||||||
|
**/coverage
|
||||||
|
**/.vitest-cache
|
||||||
|
**/.turbo
|
||||||
|
.git
|
||||||
|
.gitea
|
||||||
|
.github
|
||||||
|
.vscode
|
||||||
|
.idea
|
||||||
|
**/.env
|
||||||
|
**/.env.local
|
||||||
|
**/*.log
|
||||||
|
docs
|
||||||
|
infra/docker/data
|
||||||
|
**/__tests__
|
||||||
|
**/*.test.ts
|
||||||
|
**/*.test.tsx
|
||||||
26
.env.example
26
.env.example
@@ -10,6 +10,32 @@ API_BASE_URL=http://localhost:3078
|
|||||||
WEB_BASE_URL=http://localhost:3000
|
WEB_BASE_URL=http://localhost:3000
|
||||||
ML_SERVING_URL=http://localhost:8000
|
ML_SERVING_URL=http://localhost:8000
|
||||||
|
|
||||||
|
# MLflow (mlops profile) — http://localhost:5000/mlflow in dev, https://o.alogins.net/mlflow in prod.
|
||||||
|
# MLFLOW_ADMIN_PASSWORD seeds the admin account on first boot (changing it after first run
|
||||||
|
# requires the MLflow UI or API — see infra/mlflow/basic_auth.ini).
|
||||||
|
MLFLOW_URL=http://localhost:5000
|
||||||
|
MLFLOW_ADMIN_PASSWORD=change-me
|
||||||
|
# Public URL shown as link in the admin sidebar (must be NEXT_PUBLIC_ to reach the browser).
|
||||||
|
NEXT_PUBLIC_MLFLOW_URL=http://localhost:5000
|
||||||
|
|
||||||
|
# Airflow (mlops profile) — http://localhost:8080/airflow in dev.
|
||||||
|
# Start with: docker compose --profile full --profile mlops up
|
||||||
|
AIRFLOW_URL=http://localhost:8080
|
||||||
|
AIRFLOW_ADMIN_PASSWORD=change-me
|
||||||
|
AIRFLOW_DB_PASSWORD=airflow
|
||||||
|
AIRFLOW_SECRET_KEY=change-me-in-prod
|
||||||
|
AIRFLOW_FERNET_KEY=
|
||||||
|
AIRFLOW_BASE_URL=https://o.alogins.net/airflow
|
||||||
|
# Public URL shown as link in the admin sidebar (must be NEXT_PUBLIC_ to reach the browser).
|
||||||
|
NEXT_PUBLIC_AIRFLOW_URL=http://localhost:8080
|
||||||
|
|
||||||
|
# Shared secret for Airflow→API internal callbacks. Generate: openssl rand -hex 32
|
||||||
|
INTERNAL_API_TOKEN=
|
||||||
|
|
||||||
|
# Static token for automated/service access to the admin panel (e.g. Playwright tests).
|
||||||
|
# Leave empty to disable token-based login. Generate: openssl rand -hex 32
|
||||||
|
ADMIN_TOKEN=
|
||||||
|
|
||||||
# AI stack — shared Agap services (ollama + litellm + langfuse). Not run from oO.
|
# AI stack — shared Agap services (ollama + litellm + langfuse). Not run from oO.
|
||||||
# Prod: https://llm.alogins.net | Dev: http://host.docker.internal:4000 from containers,
|
# Prod: https://llm.alogins.net | Dev: http://host.docker.internal:4000 from containers,
|
||||||
# http://localhost:4000 from host. Ollama: http://host.docker.internal:11434 / :11434.
|
# http://localhost:4000 from host. Ollama: http://host.docker.internal:11434 / :11434.
|
||||||
|
|||||||
37
.gitea/workflows/buf-check.yaml
Normal file
37
.gitea/workflows/buf-check.yaml
Normal file
@@ -0,0 +1,37 @@
|
|||||||
|
name: buf-check
|
||||||
|
|
||||||
|
on:
|
||||||
|
push:
|
||||||
|
branches: [main]
|
||||||
|
paths:
|
||||||
|
- 'packages/shared-types/events/**'
|
||||||
|
pull_request:
|
||||||
|
paths:
|
||||||
|
- 'packages/shared-types/events/**'
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
buf:
|
||||||
|
name: Lint & breaking-change check
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v4
|
||||||
|
with:
|
||||||
|
fetch-depth: 0
|
||||||
|
|
||||||
|
- name: Install buf
|
||||||
|
run: |
|
||||||
|
BUF_VERSION=1.50.0
|
||||||
|
curl -sSfL \
|
||||||
|
"https://github.com/bufbuild/buf/releases/download/v${BUF_VERSION}/buf-Linux-x86_64" \
|
||||||
|
-o /usr/local/bin/buf
|
||||||
|
chmod +x /usr/local/bin/buf
|
||||||
|
buf --version
|
||||||
|
|
||||||
|
- name: buf lint
|
||||||
|
run: buf lint packages/shared-types/events
|
||||||
|
|
||||||
|
- name: buf breaking
|
||||||
|
if: github.event_name == 'pull_request'
|
||||||
|
run: |
|
||||||
|
buf breaking packages/shared-types/events \
|
||||||
|
--against ".git#branch=${{ github.base_ref }},subdir=packages/shared-types/events"
|
||||||
14
CLAUDE.md
14
CLAUDE.md
@@ -56,7 +56,7 @@ docs/ architecture notes, ADRs, API specs
|
|||||||
## Contracts between modules
|
## Contracts between modules
|
||||||
|
|
||||||
- **HTTP** (OpenAPI, in `packages/shared-types/http/`) — synchronous request/response. In-process today; over the network once extracted. Signatures are identical.
|
- **HTTP** (OpenAPI, in `packages/shared-types/http/`) — synchronous request/response. In-process today; over the network once extracted. Signatures are identical.
|
||||||
- **Events** (Protocol Buffers, in `packages/shared-types/events/`) — durable signals + feedback. Today: in-process `Bus` with a `onPublish` bridge to NATS JetStream when `NATS_URL` is set (ADR-0010). The in-proc bus stays the source of truth — JetStream is the durable mirror that cross-process consumers (`ml/serving`, future feature pipelines) tail. Schema registry enforced in CI when #54 lands; until then payloads are JSON envelopes (ADR-0005).
|
- **Events** (Protocol Buffers, in `packages/shared-types/events/`) — durable signals + feedback. Today: in-process `Bus` with a `onPublish` bridge to NATS JetStream when `NATS_URL` is set (ADR-0010). The in-proc bus stays the source of truth — JetStream is the durable mirror that cross-process consumers (`ml/serving`, future feature pipelines) tail. Proto schemas (ADR-0005) live in `packages/shared-types/events/oo/events/v1/`; `buf lint` + `buf breaking` run in CI on every PR touching those files (`.gitea/workflows/buf-check.yaml`).
|
||||||
- Do not redefine types per module. Regenerate from `shared-types`.
|
- Do not redefine types per module. Regenerate from `shared-types`.
|
||||||
|
|
||||||
## Conventions
|
## Conventions
|
||||||
@@ -100,7 +100,7 @@ Ollama and LiteLLM are **shared Agap services**, not oO services — they live i
|
|||||||
|
|
||||||
**M1 shipped. M2 (AI tips) in progress.** See `README.md` for the phase roadmap and `docs/architecture/` for diagrams. Work is tracked as Gitea milestones + issues on `alvis/oO`.
|
**M1 shipped. M2 (AI tips) in progress.** See `README.md` for the phase roadmap and `docs/architecture/` for diagrams. Work is tracked as Gitea milestones + issues on `alvis/oO`.
|
||||||
|
|
||||||
Active work: AI tip generation pipeline — issues #86–#93 in M2 milestone.
|
Active work: bandit promotion (#99 — offline sim + ADR-0012 pending) and M2 issues (#61 freshness SLAs, #78 signal abstraction, #93 model benchmark).
|
||||||
|
|
||||||
## What NOT to do
|
## What NOT to do
|
||||||
|
|
||||||
@@ -112,3 +112,13 @@ Active work: AI tip generation pipeline — issues #86–#93 in M2 milestone.
|
|||||||
- Don't call LLMs directly from application code. All LLM calls go through `ml/serving` (Python) via `LITELLM_URL`. The TS recommender never holds a model name.
|
- Don't call LLMs directly from application code. All LLM calls go through `ml/serving` (Python) via `LITELLM_URL`. The TS recommender never holds a model name.
|
||||||
- Don't embed MLflow/Airflow/OpenWebUI in the admin panel. They are external services; link out to them. The admin shell links to `o.alogins.net/mlflow`, `/airflow`, `ai.alogins.net`.
|
- Don't embed MLflow/Airflow/OpenWebUI in the admin panel. They are external services; link out to them. The admin shell links to `o.alogins.net/mlflow`, `/airflow`, `ai.alogins.net`.
|
||||||
- Don't `nats.publish()` directly from feature code. All publishes go through the in-process `Bus` (`services/api/src/events/bus.ts`); the NATS adapter (`events/nats.ts`) bridges every publish to JetStream when `NATS_URL` is set. This keeps subscribers, the ring-buffer tail used by the admin event viewer, and JetStream all in lockstep.
|
- Don't `nats.publish()` directly from feature code. All publishes go through the in-process `Bus` (`services/api/src/events/bus.ts`); the NATS adapter (`events/nats.ts`) bridges every publish to JetStream when `NATS_URL` is set. This keeps subscribers, the ring-buffer tail used by the admin event viewer, and JetStream all in lockstep.
|
||||||
|
|
||||||
|
## Admin app
|
||||||
|
|
||||||
|
`apps/admin` rewrites `/api/*` → `$NEXT_PUBLIC_API_URL/api/*` via `next.config.ts`. So `apiFetch('/admin/stats')` in `apps/admin/src/lib/api.ts` hits the Express backend, not a Next.js route.
|
||||||
|
|
||||||
|
Running `tsc --noEmit -p apps/admin/tsconfig.json` always reports `Cannot find module 'next'` errors — expected outside the Next.js build context; use `next build` for real type errors.
|
||||||
|
|
||||||
|
## Auth / session pattern
|
||||||
|
|
||||||
|
Sessions use an `sid` cookie. Admin routes stack `requireAuth` (sets `req.userId`) then `requireAdmin` (checks `role = 'admin'` in DB). Token-based admin auth: `POST /api/auth/token` with `{ token }` matching `ADMIN_TOKEN` env var sets the `sid` cookie — used by Playwright and CI.
|
||||||
|
|||||||
@@ -8,6 +8,15 @@ Next.js 15 app. Deployed at `admin.o.alogins.net` (dev: `http://localhost:3080`)
|
|||||||
and checks `role === 'admin'`. First admin is seeded via `ADMIN_SEED_EMAIL` env var at API startup.
|
and checks `role === 'admin'`. First admin is seeded via `ADMIN_SEED_EMAIL` env var at API startup.
|
||||||
- Admin write actions are appended to the `admin_actions` audit log in the DB.
|
- Admin write actions are appended to the `admin_actions` audit log in the DB.
|
||||||
|
|
||||||
|
## Authentication
|
||||||
|
|
||||||
|
Two ways to sign in:
|
||||||
|
|
||||||
|
| Method | How |
|
||||||
|
|--------|-----|
|
||||||
|
| Google OAuth | Click "Sign in with Google" on the login page |
|
||||||
|
| Token | `POST /api/auth/token` with `{ token }` matching `ADMIN_TOKEN` env var; sets `sid` cookie valid for 24 h. Used by Playwright tests and CI automation. |
|
||||||
|
|
||||||
## Pages
|
## Pages
|
||||||
|
|
||||||
| Route | Description |
|
| Route | Description |
|
||||||
|
|||||||
@@ -1,15 +1,67 @@
|
|||||||
|
'use client';
|
||||||
|
|
||||||
|
import { useState } from 'react';
|
||||||
|
import { useRouter } from 'next/navigation';
|
||||||
|
|
||||||
export default function LoginPage() {
|
export default function LoginPage() {
|
||||||
|
const router = useRouter();
|
||||||
|
const [token, setToken] = useState('');
|
||||||
|
const [error, setError] = useState('');
|
||||||
|
const [loading, setLoading] = useState(false);
|
||||||
|
|
||||||
|
async function handleTokenLogin(e: React.FormEvent) {
|
||||||
|
e.preventDefault();
|
||||||
|
setError('');
|
||||||
|
setLoading(true);
|
||||||
|
try {
|
||||||
|
const res = await fetch('/api/auth/token', {
|
||||||
|
method: 'POST',
|
||||||
|
credentials: 'include',
|
||||||
|
headers: { 'Content-Type': 'application/json' },
|
||||||
|
body: JSON.stringify({ token }),
|
||||||
|
});
|
||||||
|
if (!res.ok) {
|
||||||
|
const data = await res.json().catch(() => ({}));
|
||||||
|
setError((data as { error?: string }).error ?? 'Invalid token');
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
router.push('/');
|
||||||
|
} catch {
|
||||||
|
setError('Request failed');
|
||||||
|
} finally {
|
||||||
|
setLoading(false);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
return (
|
return (
|
||||||
<div className="flex min-h-screen items-center justify-center">
|
<div className="flex min-h-screen items-center justify-center">
|
||||||
<div className="text-center space-y-4">
|
<div className="text-center space-y-6 w-72">
|
||||||
<h1 className="text-2xl font-semibold">oO Admin</h1>
|
<h1 className="text-2xl font-semibold">oO Admin</h1>
|
||||||
<p className="text-gray-400 text-sm">Sign in via the main app first, then return here.</p>
|
|
||||||
<a
|
<a
|
||||||
href="/sign-in"
|
href="/sign-in"
|
||||||
className="inline-block px-4 py-2 bg-white text-black rounded text-sm font-medium hover:bg-gray-200 transition-colors"
|
className="inline-block px-4 py-2 bg-white text-black rounded text-sm font-medium hover:bg-gray-200 transition-colors"
|
||||||
>
|
>
|
||||||
Sign in with Google
|
Sign in with Google
|
||||||
</a>
|
</a>
|
||||||
|
|
||||||
|
<form onSubmit={handleTokenLogin} className="space-y-3">
|
||||||
|
<input
|
||||||
|
type="password"
|
||||||
|
placeholder="Admin token"
|
||||||
|
value={token}
|
||||||
|
onChange={(e) => setToken(e.target.value)}
|
||||||
|
className="w-full px-3 py-2 bg-gray-900 border border-gray-700 rounded text-sm focus:outline-none focus:border-gray-500"
|
||||||
|
/>
|
||||||
|
{error && <p className="text-red-400 text-xs">{error}</p>}
|
||||||
|
<button
|
||||||
|
type="submit"
|
||||||
|
disabled={loading || !token}
|
||||||
|
className="w-full px-4 py-2 bg-gray-700 text-white rounded text-sm font-medium hover:bg-gray-600 disabled:opacity-40 transition-colors"
|
||||||
|
>
|
||||||
|
{loading ? 'Signing in…' : 'Sign in with token'}
|
||||||
|
</button>
|
||||||
|
</form>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
);
|
);
|
||||||
|
|||||||
220
apps/admin/src/app/simulate/page.tsx
Normal file
220
apps/admin/src/app/simulate/page.tsx
Normal file
@@ -0,0 +1,220 @@
|
|||||||
|
'use client';
|
||||||
|
|
||||||
|
import { useEffect, useState } from 'react';
|
||||||
|
import { AdminShell } from '@/components/AdminShell';
|
||||||
|
import {
|
||||||
|
startSimulation,
|
||||||
|
getSimulationRuns,
|
||||||
|
getSimulationRun,
|
||||||
|
SimRun,
|
||||||
|
} from '@/lib/api';
|
||||||
|
|
||||||
|
const POLICIES = ['linucb-v1', 'egreedy-v1', 'egreedy-v2'];
|
||||||
|
const mlflowBase = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
|
||||||
|
const airflowBase = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
|
||||||
|
|
||||||
|
function mlflowRunUrl(runId: string) {
|
||||||
|
return `${mlflowBase}/#/experiments/1/runs/${runId}`;
|
||||||
|
}
|
||||||
|
|
||||||
|
function airflowRunUrl(dagRunId: string) {
|
||||||
|
return `${airflowBase}/dags/bandit_sim/grid?dag_run_id=${encodeURIComponent(dagRunId)}`;
|
||||||
|
}
|
||||||
|
|
||||||
|
function StatusBadge({ status }: { status: string }) {
|
||||||
|
const cls: Record<string, string> = {
|
||||||
|
running: 'bg-blue-900 text-blue-300 border-blue-800',
|
||||||
|
done: 'bg-green-900 text-green-300 border-green-800',
|
||||||
|
failed: 'bg-red-900 text-red-300 border-red-800',
|
||||||
|
pending: 'bg-gray-800 text-gray-400 border-gray-700',
|
||||||
|
};
|
||||||
|
return (
|
||||||
|
<span className={`text-xs px-2 py-0.5 rounded border ${cls[status] ?? cls.pending}`}>
|
||||||
|
{status}
|
||||||
|
</span>
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
function SummaryRow({ run }: { run: SimRun }) {
|
||||||
|
const summary = run.summaryJson ? JSON.parse(run.summaryJson) as Record<string, { total_reward: number; mean_reward: number; n_pulls: number }> : null;
|
||||||
|
return (
|
||||||
|
<div className="bg-gray-900 border border-gray-800 rounded p-4 space-y-2">
|
||||||
|
<div className="flex items-center justify-between">
|
||||||
|
<div className="space-y-0.5">
|
||||||
|
<div className="flex items-center gap-2">
|
||||||
|
<span className="font-mono text-xs text-gray-500">{run.id}</span>
|
||||||
|
<StatusBadge status={run.status} />
|
||||||
|
{run.winner && <span className="text-xs text-indigo-400">winner: {run.winner}</span>}
|
||||||
|
</div>
|
||||||
|
<div className="text-xs text-gray-600">
|
||||||
|
{run.nUsers}u × {run.nRounds}r × {run.tasksPerRound}t/r — {run.judgeMode} judge
|
||||||
|
{' · '}{new Date(run.createdAt).toLocaleString()}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div className="flex items-center gap-2 flex-shrink-0">
|
||||||
|
{run.mlflowRunId && (
|
||||||
|
<a href={mlflowRunUrl(run.mlflowRunId)} target="_blank" rel="noreferrer"
|
||||||
|
className="text-xs text-indigo-400 hover:underline">MLflow ↗</a>
|
||||||
|
)}
|
||||||
|
{run.airflowDagRunId && (
|
||||||
|
<a href={airflowRunUrl(run.airflowDagRunId)} target="_blank" rel="noreferrer"
|
||||||
|
className="text-xs text-indigo-400 hover:underline">Airflow ↗</a>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
{summary && (
|
||||||
|
<div className="grid grid-cols-2 gap-2 pt-1 lg:grid-cols-3">
|
||||||
|
{Object.entries(summary).map(([policy, s]) => (
|
||||||
|
<div key={policy} className={`rounded border p-2 text-xs ${policy === run.winner ? 'border-indigo-700 bg-indigo-950' : 'border-gray-800'}`}>
|
||||||
|
<div className="font-mono font-medium text-gray-300 mb-1">{policy}</div>
|
||||||
|
<div className="text-gray-500 space-y-0.5">
|
||||||
|
<div>total <span className="text-gray-300">{s.total_reward.toFixed(2)}</span></div>
|
||||||
|
<div>mean <span className="text-gray-300">{s.mean_reward.toFixed(4)}</span></div>
|
||||||
|
<div>pulls <span className="text-gray-300">{s.n_pulls}</span></div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
export default function SimulatePage() {
|
||||||
|
const [runs, setRuns] = useState<SimRun[]>([]);
|
||||||
|
const [loading, setLoading] = useState(true);
|
||||||
|
const [launching, setLaunching] = useState(false);
|
||||||
|
const [error, setError] = useState('');
|
||||||
|
const [msg, setMsg] = useState('');
|
||||||
|
|
||||||
|
const [nUsers, setNUsers] = useState(5);
|
||||||
|
const [nRounds, setNRounds] = useState(20);
|
||||||
|
const [tasksPerRound, setTasksPerRound] = useState(8);
|
||||||
|
const [judgeMode, setJudgeMode] = useState<'rule' | 'llm'>('rule');
|
||||||
|
const [selectedPolicies, setSelectedPolicies] = useState<string[]>(['linucb-v1', 'egreedy-v1']);
|
||||||
|
|
||||||
|
const refresh = () =>
|
||||||
|
getSimulationRuns()
|
||||||
|
.then((r) => setRuns(r.runs))
|
||||||
|
.catch((e) => setError(e.message))
|
||||||
|
.finally(() => setLoading(false));
|
||||||
|
|
||||||
|
useEffect(() => {
|
||||||
|
refresh();
|
||||||
|
const t = setInterval(refresh, 8_000);
|
||||||
|
return () => clearInterval(t);
|
||||||
|
}, []);
|
||||||
|
|
||||||
|
const togglePolicy = (p: string) =>
|
||||||
|
setSelectedPolicies((prev) =>
|
||||||
|
prev.includes(p) ? prev.filter((x) => x !== p) : [...prev, p],
|
||||||
|
);
|
||||||
|
|
||||||
|
const handleLaunch = async () => {
|
||||||
|
if (selectedPolicies.length < 2) { setError('Select at least 2 policies.'); return; }
|
||||||
|
setLaunching(true); setError(''); setMsg('');
|
||||||
|
try {
|
||||||
|
const r = await startSimulation({ nUsers, nRounds, tasksPerRound, judgeMode, policies: selectedPolicies });
|
||||||
|
setMsg(r.airflow_dag_run_id
|
||||||
|
? `Launched via Airflow — dag_run_id: ${r.airflow_dag_run_id}`
|
||||||
|
: `Launched locally — run id: ${r.id}`);
|
||||||
|
await refresh();
|
||||||
|
} catch (e: unknown) {
|
||||||
|
setError((e as Error).message);
|
||||||
|
} finally {
|
||||||
|
setLaunching(false);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
return (
|
||||||
|
<AdminShell>
|
||||||
|
<div className="space-y-8 max-w-4xl">
|
||||||
|
<h1 className="text-xl font-semibold">Simulations</h1>
|
||||||
|
{error && <p className="text-red-400 text-sm">{error}</p>}
|
||||||
|
{msg && <p className="text-green-400 text-sm">{msg}</p>}
|
||||||
|
|
||||||
|
{/* Launch form */}
|
||||||
|
<section className="bg-gray-900 border border-gray-800 rounded p-5 space-y-4">
|
||||||
|
<h2 className="text-base font-medium text-gray-300">New simulation</h2>
|
||||||
|
|
||||||
|
<div className="grid grid-cols-3 gap-4 text-sm">
|
||||||
|
<label className="space-y-1">
|
||||||
|
<span className="text-gray-500">Users</span>
|
||||||
|
<input type="number" min={1} max={50} value={nUsers}
|
||||||
|
onChange={(e) => setNUsers(Number(e.target.value))}
|
||||||
|
className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
|
||||||
|
</label>
|
||||||
|
<label className="space-y-1">
|
||||||
|
<span className="text-gray-500">Rounds</span>
|
||||||
|
<input type="number" min={1} max={200} value={nRounds}
|
||||||
|
onChange={(e) => setNRounds(Number(e.target.value))}
|
||||||
|
className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
|
||||||
|
</label>
|
||||||
|
<label className="space-y-1">
|
||||||
|
<span className="text-gray-500">Tasks/round</span>
|
||||||
|
<input type="number" min={1} max={20} value={tasksPerRound}
|
||||||
|
onChange={(e) => setTasksPerRound(Number(e.target.value))}
|
||||||
|
className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
|
||||||
|
</label>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div className="space-y-1 text-sm">
|
||||||
|
<span className="text-gray-500">Policies (select ≥ 2)</span>
|
||||||
|
<div className="flex gap-2 flex-wrap pt-1">
|
||||||
|
{POLICIES.map((p) => (
|
||||||
|
<button key={p} onClick={() => togglePolicy(p)}
|
||||||
|
className={`px-3 py-1 rounded border text-xs font-mono ${
|
||||||
|
selectedPolicies.includes(p)
|
||||||
|
? 'bg-indigo-900 border-indigo-700 text-indigo-200'
|
||||||
|
: 'border-gray-700 text-gray-500 hover:border-gray-500'
|
||||||
|
}`}>
|
||||||
|
{p}
|
||||||
|
</button>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div className="space-y-1 text-sm">
|
||||||
|
<span className="text-gray-500">Judge</span>
|
||||||
|
<div className="flex gap-2 pt-1">
|
||||||
|
{(['rule', 'llm'] as const).map((m) => (
|
||||||
|
<button key={m} onClick={() => setJudgeMode(m)}
|
||||||
|
className={`px-3 py-1 rounded border text-xs ${
|
||||||
|
judgeMode === m
|
||||||
|
? 'bg-gray-700 border-gray-500 text-white'
|
||||||
|
: 'border-gray-700 text-gray-500 hover:border-gray-500'
|
||||||
|
}`}>
|
||||||
|
{m}
|
||||||
|
</button>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
{judgeMode === 'llm' && (
|
||||||
|
<p className="text-xs text-yellow-600 mt-1">LLM judge requires ANTHROPIC_API_KEY in ml/serving env.</p>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<button onClick={handleLaunch} disabled={launching}
|
||||||
|
className="bg-indigo-600 hover:bg-indigo-500 disabled:opacity-50 text-white rounded px-4 py-2 text-sm">
|
||||||
|
{launching ? 'Launching…' : 'Launch simulation'}
|
||||||
|
</button>
|
||||||
|
<p className="text-xs text-gray-600">
|
||||||
|
Runs via <a href={airflowBase} target="_blank" rel="noreferrer" className="text-indigo-500 hover:underline">Airflow</a> (mlops profile) when available; falls back to local subprocess.
|
||||||
|
Results logged to <a href={mlflowBase} target="_blank" rel="noreferrer" className="text-indigo-500 hover:underline">MLflow</a>.
|
||||||
|
</p>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
{/* Run history */}
|
||||||
|
<section className="space-y-3">
|
||||||
|
<h2 className="text-base font-medium text-gray-300">
|
||||||
|
Run history
|
||||||
|
{loading && <span className="text-xs text-gray-600 ml-2">loading…</span>}
|
||||||
|
</h2>
|
||||||
|
{runs.length === 0 && !loading && (
|
||||||
|
<p className="text-gray-600 text-sm">No simulations yet.</p>
|
||||||
|
)}
|
||||||
|
{runs.map((r) => <SummaryRow key={r.id} run={r} />)}
|
||||||
|
</section>
|
||||||
|
</div>
|
||||||
|
</AdminShell>
|
||||||
|
);
|
||||||
|
}
|
||||||
@@ -2,6 +2,7 @@
|
|||||||
|
|
||||||
import Link from 'next/link';
|
import Link from 'next/link';
|
||||||
import { usePathname } from 'next/navigation';
|
import { usePathname } from 'next/navigation';
|
||||||
|
import { useEffect, useState } from 'react';
|
||||||
|
|
||||||
const mlflowUrl = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
|
const mlflowUrl = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
|
||||||
const airflowUrl = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
|
const airflowUrl = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
|
||||||
@@ -10,6 +11,7 @@ type NavItem = {
|
|||||||
href: string;
|
href: string;
|
||||||
label: string;
|
label: string;
|
||||||
external?: boolean;
|
external?: boolean;
|
||||||
|
svcName?: string; // key in the health services map
|
||||||
};
|
};
|
||||||
|
|
||||||
type NavSection = {
|
type NavSection = {
|
||||||
@@ -31,10 +33,11 @@ const NAV: NavSection[] = [
|
|||||||
],
|
],
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
label: 'Recommender status',
|
label: 'Recommender',
|
||||||
items: [
|
items: [
|
||||||
{ href: '/tips', label: 'Tips' },
|
{ href: '/tips', label: 'Tips' },
|
||||||
{ href: '/reward-analytics', label: 'Rewards' },
|
{ href: '/reward-analytics', label: 'Rewards' },
|
||||||
|
{ href: '/simulate', label: 'Simulations' },
|
||||||
],
|
],
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -50,14 +53,33 @@ const NAV: NavSection[] = [
|
|||||||
label: 'Resources',
|
label: 'Resources',
|
||||||
items: [
|
items: [
|
||||||
{ href: '/docs', label: 'Docs' },
|
{ href: '/docs', label: 'Docs' },
|
||||||
{ href: mlflowUrl, label: 'MLflow ↗', external: true },
|
{ href: mlflowUrl, label: 'MLflow ↗', external: true, svcName: 'mlflow' },
|
||||||
{ href: airflowUrl, label: 'Airflow ↗', external: true },
|
{ href: airflowUrl, label: 'Airflow ↗', external: true, svcName: 'airflow' },
|
||||||
],
|
],
|
||||||
},
|
},
|
||||||
];
|
];
|
||||||
|
|
||||||
|
const STATUS_DOT: Record<string, string> = {
|
||||||
|
ok: 'bg-green-500',
|
||||||
|
degraded: 'bg-yellow-400',
|
||||||
|
down: 'bg-red-500',
|
||||||
|
};
|
||||||
|
|
||||||
export function AdminShell({ children }: { children: React.ReactNode }) {
|
export function AdminShell({ children }: { children: React.ReactNode }) {
|
||||||
const pathname = usePathname();
|
const pathname = usePathname();
|
||||||
|
const [svcStatus, setSvcStatus] = useState<Record<string, string>>({});
|
||||||
|
|
||||||
|
useEffect(() => {
|
||||||
|
fetch('/api/admin/health', { credentials: 'include' })
|
||||||
|
.then((r) => r.json())
|
||||||
|
.then((data: { services?: { name: string; status: string }[] }) => {
|
||||||
|
const map: Record<string, string> = {};
|
||||||
|
for (const s of data.services ?? []) map[s.name] = s.status;
|
||||||
|
setSvcStatus(map);
|
||||||
|
})
|
||||||
|
.catch(() => {});
|
||||||
|
}, []);
|
||||||
|
|
||||||
return (
|
return (
|
||||||
<div className="flex min-h-screen">
|
<div className="flex min-h-screen">
|
||||||
{/* Sidebar */}
|
{/* Sidebar */}
|
||||||
@@ -83,13 +105,19 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
|
|||||||
const active =
|
const active =
|
||||||
!item.external &&
|
!item.external &&
|
||||||
(item.href === '/' ? pathname === '/' : pathname.startsWith(item.href));
|
(item.href === '/' ? pathname === '/' : pathname.startsWith(item.href));
|
||||||
const className = `flex items-center px-3 py-2 rounded text-sm transition-colors ${
|
const className = `flex items-center gap-2 px-3 py-2 rounded text-sm transition-colors ${
|
||||||
active
|
active
|
||||||
? 'bg-gray-800 text-white font-medium'
|
? 'bg-gray-800 text-white font-medium'
|
||||||
: item.external
|
: item.external
|
||||||
? 'text-gray-500 hover:text-white hover:bg-gray-900'
|
? 'text-gray-500 hover:text-white hover:bg-gray-900'
|
||||||
: 'text-gray-400 hover:text-white hover:bg-gray-900'
|
: 'text-gray-400 hover:text-white hover:bg-gray-900'
|
||||||
}`;
|
}`;
|
||||||
|
const dot = item.svcName
|
||||||
|
? svcStatus[item.svcName]
|
||||||
|
? <span className={`inline-block w-1.5 h-1.5 rounded-full flex-shrink-0 ${STATUS_DOT[svcStatus[item.svcName]] ?? STATUS_DOT.down}`} />
|
||||||
|
: <span className="inline-block w-1.5 h-1.5 rounded-full flex-shrink-0 bg-gray-700" />
|
||||||
|
: null;
|
||||||
|
|
||||||
return item.external ? (
|
return item.external ? (
|
||||||
<a
|
<a
|
||||||
key={item.href}
|
key={item.href}
|
||||||
@@ -98,6 +126,7 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
|
|||||||
rel="noreferrer"
|
rel="noreferrer"
|
||||||
className={className}
|
className={className}
|
||||||
>
|
>
|
||||||
|
{dot}
|
||||||
{item.label}
|
{item.label}
|
||||||
</a>
|
</a>
|
||||||
) : (
|
) : (
|
||||||
|
|||||||
@@ -262,3 +262,49 @@ export function saveQuery(name: string, querySql: string) {
|
|||||||
export function deleteSavedQuery(id: string) {
|
export function deleteSavedQuery(id: string) {
|
||||||
return apiFetch<{ ok: boolean }>(`/admin/saved-queries/${id}`, { method: 'DELETE' });
|
return apiFetch<{ ok: boolean }>(`/admin/saved-queries/${id}`, { method: 'DELETE' });
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ── Simulations ────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
export interface SimRun {
|
||||||
|
id: string;
|
||||||
|
policyA: string;
|
||||||
|
policyB: string;
|
||||||
|
nUsers: number;
|
||||||
|
nRounds: number;
|
||||||
|
tasksPerRound: number;
|
||||||
|
judgeMode: string;
|
||||||
|
nPolicies: number;
|
||||||
|
status: 'pending' | 'running' | 'done' | 'failed';
|
||||||
|
summaryJson: string | null;
|
||||||
|
winner: string | null;
|
||||||
|
personaBreakdownJson: string | null;
|
||||||
|
airflowDagRunId: string | null;
|
||||||
|
mlflowRunId: string | null;
|
||||||
|
createdAt: string;
|
||||||
|
finishedAt: string | null;
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface SimStartRequest {
|
||||||
|
nUsers?: number;
|
||||||
|
nRounds?: number;
|
||||||
|
tasksPerRound?: number;
|
||||||
|
judgeMode?: 'rule' | 'llm';
|
||||||
|
policies?: string[];
|
||||||
|
}
|
||||||
|
|
||||||
|
export function startSimulation(req: SimStartRequest) {
|
||||||
|
return apiFetch<{ id: string; status: string; airflow_dag_run_id?: string }>(
|
||||||
|
'/admin/simulate/start',
|
||||||
|
{ method: 'POST', body: JSON.stringify(req) },
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
export function getSimulationRuns() {
|
||||||
|
return apiFetch<{ runs: SimRun[] }>('/admin/simulate/runs');
|
||||||
|
}
|
||||||
|
|
||||||
|
export function getSimulationRun(id: string) {
|
||||||
|
return apiFetch<{ run: SimRun & { isRunning: boolean }; events: unknown[] }>(
|
||||||
|
`/admin/simulate/${id}`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|||||||
File diff suppressed because one or more lines are too long
@@ -1,7 +1,7 @@
|
|||||||
# ADR-0012 — ε-greedy v2: profile features in the bandit (D=7→12)
|
# ADR-0012 — ε-greedy v2: profile features in the bandit (D=7→12)
|
||||||
|
|
||||||
**Status:** Accepted
|
**Status:** Promoted
|
||||||
**Date:** 2026-04-25
|
**Date:** 2026-04-25 (accepted) / 2026-04-26 (promoted)
|
||||||
**Issue:** #99
|
**Issue:** #99
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
@@ -106,3 +106,19 @@ projecting theta without the corresponding `A` matrix cannot be done correctly.
|
|||||||
the D=12 target in the issue spec and complicates the sim comparison. Deferred.
|
the D=12 target in the issue spec and complicates the sim comparison. Deferred.
|
||||||
|
|
||||||
**In-place v1 promotion without shadow** — violates ADR-0002.
|
**In-place v1 promotion without shadow** — violates ADR-0002.
|
||||||
|
|
||||||
|
## Promotion record (2026-04-26)
|
||||||
|
|
||||||
|
Offline sim (`runner.py --policies egreedy-v1 egreedy-v2 --judge rule --n-users 5 --n-rounds 20 --seed 42`):
|
||||||
|
|
||||||
|
| policy | total reward | mean reward | pulls |
|
||||||
|
|--------|-------------|-------------|-------|
|
||||||
|
| egreedy-v1 | −64.20 | −0.6420 | 100 |
|
||||||
|
| egreedy-v2 | −62.90 | −0.6290 | 100 |
|
||||||
|
|
||||||
|
**Gate passed** (v2 mean ≥ v1 mean). Per-persona: v2 wins deadline-driven, evening-relaxed, low-priority-first; v1 wins consistent-responder, overdue-ignorer.
|
||||||
|
|
||||||
|
Changes applied:
|
||||||
|
- `recommender.ts` `remotePolicy()`: `/score/egreedy` → `/score/egreedy/v2`
|
||||||
|
- `recommender.ts` `sendRewardWithRetry()`: `/reward/egreedy` → `/reward/egreedy/v2`, added `profile_features` to payload
|
||||||
|
- Shadow entry `egreedy-v2-shadow` left in registry (`active: false`) for rollback.
|
||||||
|
|||||||
@@ -1,21 +1,22 @@
|
|||||||
FROM node:22-alpine AS base
|
# syntax=docker/dockerfile:1.7
|
||||||
RUN npm install -g pnpm
|
|
||||||
|
|
||||||
FROM base AS deps
|
FROM node:22-slim AS base
|
||||||
WORKDIR /app
|
RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates \
|
||||||
COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
|
&& rm -rf /var/lib/apt/lists/* \
|
||||||
COPY packages/shared-types/package.json ./packages/shared-types/
|
&& npm install -g pnpm
|
||||||
COPY apps/admin/package.json ./apps/admin/
|
ENV CI=true \
|
||||||
RUN pnpm install --frozen-lockfile
|
PNPM_HOME=/pnpm \
|
||||||
|
PATH=/pnpm:$PATH
|
||||||
|
RUN pnpm config set store-dir /pnpm/store
|
||||||
|
|
||||||
FROM base AS builder
|
FROM base AS builder
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
COPY --from=deps /app/node_modules ./node_modules
|
COPY pnpm-lock.yaml ./
|
||||||
COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules
|
RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm fetch
|
||||||
COPY --from=deps /app/apps/admin/node_modules ./apps/admin/node_modules
|
COPY . .
|
||||||
COPY tsconfig.base.json ./
|
RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
|
||||||
COPY packages/shared-types ./packages/shared-types
|
pnpm install --frozen-lockfile --offline \
|
||||||
COPY apps/admin ./apps/admin
|
--filter @oo/admin... --filter @oo/shared-types
|
||||||
RUN pnpm --filter @oo/shared-types build
|
RUN pnpm --filter @oo/shared-types build
|
||||||
ARG NEXT_PUBLIC_MLFLOW_URL=/mlflow
|
ARG NEXT_PUBLIC_MLFLOW_URL=/mlflow
|
||||||
ARG NEXT_PUBLIC_AIRFLOW_URL=/airflow
|
ARG NEXT_PUBLIC_AIRFLOW_URL=/airflow
|
||||||
@@ -24,7 +25,7 @@ ENV NEXT_TELEMETRY_DISABLED=1 \
|
|||||||
NEXT_PUBLIC_AIRFLOW_URL=$NEXT_PUBLIC_AIRFLOW_URL
|
NEXT_PUBLIC_AIRFLOW_URL=$NEXT_PUBLIC_AIRFLOW_URL
|
||||||
RUN pnpm --filter @oo/admin build
|
RUN pnpm --filter @oo/admin build
|
||||||
|
|
||||||
FROM node:22-alpine AS runner
|
FROM node:22-slim AS runner
|
||||||
ENV NODE_ENV=production NEXT_TELEMETRY_DISABLED=1 PORT=3080
|
ENV NODE_ENV=production NEXT_TELEMETRY_DISABLED=1 PORT=3080
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
COPY --from=builder /app/apps/admin/.next/standalone ./
|
COPY --from=builder /app/apps/admin/.next/standalone ./
|
||||||
|
|||||||
@@ -1,32 +1,35 @@
|
|||||||
FROM node:22-alpine AS base
|
# syntax=docker/dockerfile:1.7
|
||||||
RUN npm install -g pnpm
|
|
||||||
|
|
||||||
FROM base AS deps
|
FROM node:22-slim AS base
|
||||||
WORKDIR /app
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||||
COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
|
python3 make g++ ca-certificates \
|
||||||
COPY packages/shared-types/package.json ./packages/shared-types/
|
&& rm -rf /var/lib/apt/lists/* \
|
||||||
COPY services/api/package.json ./services/api/
|
&& npm install -g pnpm
|
||||||
RUN pnpm install --frozen-lockfile
|
ENV CI=true \
|
||||||
|
PNPM_HOME=/pnpm \
|
||||||
|
PATH=/pnpm:$PATH
|
||||||
|
RUN pnpm config set store-dir /pnpm/store
|
||||||
|
|
||||||
FROM base AS builder
|
FROM base AS builder
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
COPY --from=deps /app/node_modules ./node_modules
|
COPY pnpm-lock.yaml ./
|
||||||
COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules
|
RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm fetch
|
||||||
COPY --from=deps /app/services/api/node_modules ./services/api/node_modules
|
COPY . .
|
||||||
COPY tsconfig.base.json ./
|
RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
|
||||||
COPY packages/shared-types ./packages/shared-types
|
pnpm install --frozen-lockfile --offline \
|
||||||
COPY services/api ./services/api
|
--filter @oo/api... --filter @oo/shared-types
|
||||||
RUN pnpm --filter @oo/shared-types build
|
RUN pnpm --filter @oo/shared-types build
|
||||||
RUN pnpm --filter @oo/api build
|
RUN pnpm --filter @oo/api build
|
||||||
|
RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
|
||||||
|
pnpm --filter @oo/api --prod deploy --legacy /deploy \
|
||||||
|
&& cp -r services/api/dist /deploy/dist \
|
||||||
|
&& rm -rf /deploy/node_modules/@oo/shared-types/src \
|
||||||
|
&& cp -r packages/shared-types/dist /deploy/node_modules/@oo/shared-types/dist
|
||||||
|
|
||||||
FROM node:22-alpine AS runner
|
FROM node:22-slim AS runner
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
RUN npm install -g pnpm
|
ENV NODE_ENV=production
|
||||||
COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
|
COPY --from=builder /deploy/package.json ./
|
||||||
COPY packages/shared-types/package.json ./packages/shared-types/
|
COPY --from=builder /deploy/node_modules ./node_modules
|
||||||
COPY services/api/package.json ./services/api/
|
COPY --from=builder /deploy/dist ./dist
|
||||||
RUN pnpm install --prod --frozen-lockfile
|
|
||||||
COPY --from=builder /app/packages/shared-types/dist ./packages/shared-types/dist
|
|
||||||
COPY --from=builder /app/services/api/dist ./services/api/dist
|
|
||||||
WORKDIR /app/services/api
|
|
||||||
CMD ["node", "dist/index.js"]
|
CMD ["node", "dist/index.js"]
|
||||||
|
|||||||
@@ -2,5 +2,5 @@ FROM python:3.12-slim
|
|||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
COPY ml/serving/requirements.txt .
|
COPY ml/serving/requirements.txt .
|
||||||
RUN pip install --no-cache-dir -r requirements.txt
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
COPY ml/serving/main.py .
|
COPY ml/serving/*.py .
|
||||||
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
|
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||||
|
|||||||
@@ -11,12 +11,18 @@ services:
|
|||||||
env_file: ../../.env.local
|
env_file: ../../.env.local
|
||||||
environment:
|
environment:
|
||||||
NODE_ENV: production
|
NODE_ENV: production
|
||||||
|
ML_SERVING_URL: "http://ml-serving:8000"
|
||||||
|
MLFLOW_URL: "http://mlflow:5000"
|
||||||
|
AIRFLOW_URL: "http://airflow-webserver:8080"
|
||||||
|
AIRFLOW_API_USER: "admin"
|
||||||
|
AIRFLOW_API_PASSWORD: "${AIRFLOW_ADMIN_PASSWORD:-admin}"
|
||||||
|
INTERNAL_API_TOKEN: "${INTERNAL_API_TOKEN:-}"
|
||||||
volumes:
|
volumes:
|
||||||
- /mnt/ssd/dbs/oo:/mnt/ssd/dbs/oo
|
- /mnt/ssd/dbs/oo:/mnt/ssd/dbs/oo
|
||||||
ports:
|
ports:
|
||||||
- "127.0.0.1:3078:3078"
|
- "127.0.0.1:3078:3078"
|
||||||
healthcheck:
|
healthcheck:
|
||||||
test: ["CMD", "wget", "--spider", "-q", "http://localhost:3078/health"]
|
test: ["CMD", "node", "-e", "fetch('http://localhost:3078/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"]
|
||||||
interval: 10s
|
interval: 10s
|
||||||
timeout: 5s
|
timeout: 5s
|
||||||
retries: 5
|
retries: 5
|
||||||
@@ -49,6 +55,8 @@ services:
|
|||||||
PORT: "3080"
|
PORT: "3080"
|
||||||
HOSTNAME: "0.0.0.0"
|
HOSTNAME: "0.0.0.0"
|
||||||
NEXT_PUBLIC_API_URL: ""
|
NEXT_PUBLIC_API_URL: ""
|
||||||
|
NEXT_PUBLIC_MLFLOW_URL: "/mlflow"
|
||||||
|
NEXT_PUBLIC_AIRFLOW_URL: "/airflow"
|
||||||
INTERNAL_API_URL: "http://api:3078"
|
INTERNAL_API_URL: "http://api:3078"
|
||||||
ports:
|
ports:
|
||||||
- "127.0.0.1:3080:3080"
|
- "127.0.0.1:3080:3080"
|
||||||
@@ -133,8 +141,14 @@ services:
|
|||||||
AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY:-change-me-in-prod}
|
AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY:-change-me-in-prod}
|
||||||
AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
|
AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
|
||||||
AIRFLOW__WEBSERVER__BASE_URL: ${AIRFLOW_BASE_URL:-https://o.alogins.net/airflow}
|
AIRFLOW__WEBSERVER__BASE_URL: ${AIRFLOW_BASE_URL:-https://o.alogins.net/airflow}
|
||||||
|
AIRFLOW__API__AUTH_BACKENDS: "airflow.api.auth.backend.basic_auth"
|
||||||
|
_PIP_ADDITIONAL_REQUIREMENTS: "mlflow==2.14.3 httpx"
|
||||||
|
MLFLOW_TRACKING_URI: "http://mlflow:5000/mlflow"
|
||||||
|
MLFLOW_TRACKING_USERNAME: "admin"
|
||||||
|
MLFLOW_TRACKING_PASSWORD: "${MLFLOW_ADMIN_PASSWORD:-password}"
|
||||||
volumes:
|
volumes:
|
||||||
- ../../ml/pipelines:/opt/airflow/dags:ro
|
- ../../ml/pipelines:/opt/airflow/dags:ro
|
||||||
|
- ../../ml:/opt/airflow/ml:ro
|
||||||
ports:
|
ports:
|
||||||
- "127.0.0.1:8080:8080"
|
- "127.0.0.1:8080:8080"
|
||||||
depends_on:
|
depends_on:
|
||||||
@@ -155,8 +169,13 @@ services:
|
|||||||
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${AIRFLOW_DB_PASSWORD:-airflow}@airflow-db/airflow
|
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${AIRFLOW_DB_PASSWORD:-airflow}@airflow-db/airflow
|
||||||
AIRFLOW__CORE__EXECUTOR: LocalExecutor
|
AIRFLOW__CORE__EXECUTOR: LocalExecutor
|
||||||
AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
|
AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
|
||||||
|
_PIP_ADDITIONAL_REQUIREMENTS: "mlflow==2.14.3 httpx"
|
||||||
|
MLFLOW_TRACKING_URI: "http://mlflow:5000/mlflow"
|
||||||
|
MLFLOW_TRACKING_USERNAME: "admin"
|
||||||
|
MLFLOW_TRACKING_PASSWORD: "${MLFLOW_ADMIN_PASSWORD:-password}"
|
||||||
volumes:
|
volumes:
|
||||||
- ../../ml/pipelines:/opt/airflow/dags:ro
|
- ../../ml/pipelines:/opt/airflow/dags:ro
|
||||||
|
- ../../ml:/opt/airflow/ml:ro
|
||||||
depends_on:
|
depends_on:
|
||||||
airflow-init:
|
airflow-init:
|
||||||
condition: service_completed_successfully
|
condition: service_completed_successfully
|
||||||
|
|||||||
24
ml/README.md
24
ml/README.md
@@ -4,8 +4,8 @@ Python. Owns models, features, training, online scoring.
|
|||||||
|
|
||||||
| Dir | Role | Phase |
|
| Dir | Role | Phase |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| `serving/` | FastAPI online scorer (`/score`, `/generate`) + LiteLLM gateway + prompt registry (`prompts.py`), called by `recommender` | 1–2 |
|
| `serving/` | FastAPI online scorer (`/score`, `/generate`) + LiteLLM gateway + prompt registry (`prompts.py`) + JetStream consumers for `signals.>` / `feedback.>`, called by `recommender` | 1–2 |
|
||||||
| `features/` | context assembler (`context.py`): signals → `PromptContext`; Feast adapter later | 2 |
|
| `features/` | context assembler (`context.py`): signals → `PromptContext`; profile-feature schema mirror (`profile_schema.py`); Feast adapter later | 2 |
|
||||||
| `pipelines/` | batch feature + training DAGs (Prefect/Airflow) | 4 |
|
| `pipelines/` | batch feature + training DAGs (Prefect/Airflow) | 4 |
|
||||||
| `registry/` | MLflow-backed model registry integration | 4 |
|
| `registry/` | MLflow-backed model registry integration | 4 |
|
||||||
| `experiments/` | A/B assignment + multi-armed bandit policies | 4 |
|
| `experiments/` | A/B assignment + multi-armed bandit policies | 4 |
|
||||||
@@ -18,14 +18,24 @@ Python. Owns models, features, training, online scoring.
|
|||||||
- Training reads from the offline feature store; serving reads from the online feature store; definitions are shared (no train/serve skew).
|
- Training reads from the offline feature store; serving reads from the online feature store; definitions are shared (no train/serve skew).
|
||||||
- Shadow deploys before any policy change that affects real users.
|
- Shadow deploys before any policy change that affects real users.
|
||||||
|
|
||||||
## Profile-feature contract
|
## Feature contract
|
||||||
|
|
||||||
|
### Profile features (batched)
|
||||||
|
|
||||||
User-level features (completion rate, preferred hour, tip volume…) are computed
|
User-level features (completion rate, preferred hour, tip volume…) are computed
|
||||||
by the TypeScript recommender and shipped to ml/serving on every `/score` and
|
by the TypeScript recommender and shipped to `ml/serving` on every `/score` and
|
||||||
`/generate` call as `profile_features: dict | None`. The Python mirror in
|
`/generate` call as `profile_features: dict | None`. The Python mirror in
|
||||||
`features/profile_schema.py` documents the available names + dtypes — keep it
|
`features/profile_schema.py` documents each feature's name, dtype, TTL, source,
|
||||||
in sync with `services/api/src/profile/registry.ts` (a CI-style test asserts
|
and null fallback — keep it in sync with `services/api/src/profile/registry.ts`
|
||||||
the name sets match). See ADR-0011.
|
(a CI-style test asserts names and `ttlSec` values match). See ADR-0011.
|
||||||
|
|
||||||
|
### Context features (JIT)
|
||||||
|
|
||||||
|
Request-time signals assembled by `features/context.py` (`hour_of_day`,
|
||||||
|
`day_of_week`, task list). These are never cached — they are derived from the
|
||||||
|
system clock and the live Todoist feed at the moment of the score call.
|
||||||
|
`CONTEXT_FEATURES` in `context.py` declares freshness, source, and fallback for
|
||||||
|
each field (issue #61).
|
||||||
|
|
||||||
## Prompt registry
|
## Prompt registry
|
||||||
|
|
||||||
|
|||||||
@@ -26,6 +26,7 @@ from __future__ import annotations
|
|||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
import json
|
import json
|
||||||
|
import os
|
||||||
import random
|
import random
|
||||||
import sys
|
import sys
|
||||||
import time
|
import time
|
||||||
@@ -40,6 +41,12 @@ from llm_judge import ACTIONS, infer_reward, judge
|
|||||||
from personas import PERSONAS, Persona
|
from personas import PERSONAS, Persona
|
||||||
from task_generator import generate_task_pool
|
from task_generator import generate_task_pool
|
||||||
|
|
||||||
|
try:
|
||||||
|
import mlflow
|
||||||
|
_MLFLOW_AVAILABLE = True
|
||||||
|
except ImportError:
|
||||||
|
_MLFLOW_AVAILABLE = False
|
||||||
|
|
||||||
POLICY_SCORE_ENDPOINTS: dict[str, str] = {
|
POLICY_SCORE_ENDPOINTS: dict[str, str] = {
|
||||||
"linucb-v1": "/score",
|
"linucb-v1": "/score",
|
||||||
"egreedy-v1": "/score/egreedy",
|
"egreedy-v1": "/score/egreedy",
|
||||||
@@ -107,14 +114,30 @@ def _call_reward(
|
|||||||
|
|
||||||
# ── Standard single-pass runner (rule / llm modes) ─────────────────────────
|
# ── Standard single-pass runner (rule / llm modes) ─────────────────────────
|
||||||
|
|
||||||
|
def _init_mlflow(mlflow_url: str | None, experiment: str) -> str | None:
|
||||||
|
"""Set up MLflow tracking and return the active run_id, or None if unavailable."""
|
||||||
|
if not _MLFLOW_AVAILABLE or not mlflow_url:
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
mlflow.set_tracking_uri(mlflow_url)
|
||||||
|
mlflow.set_experiment(experiment)
|
||||||
|
return "ready"
|
||||||
|
except Exception as e:
|
||||||
|
print(f" [warn] MLflow init failed: {e}", file=sys.stderr)
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
def run_simulation(
|
def run_simulation(
|
||||||
n_users: int, n_rounds: int, tasks_per_round: int,
|
n_users: int, n_rounds: int, tasks_per_round: int,
|
||||||
ml_url: str, policies: list[str], use_llm: bool, seed: int,
|
ml_url: str, policies: list[str], use_llm: bool, seed: int,
|
||||||
|
mlflow_url: str | None = None, mlflow_experiment: str = "bandit_simulation",
|
||||||
) -> dict:
|
) -> dict:
|
||||||
rng = random.Random(seed)
|
rng = random.Random(seed)
|
||||||
run_id = str(uuid.uuid4())[:8]
|
run_id = str(uuid.uuid4())[:8]
|
||||||
started_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
started_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
||||||
|
|
||||||
|
_init_mlflow(mlflow_url, mlflow_experiment)
|
||||||
|
|
||||||
user_personas = [
|
user_personas = [
|
||||||
(f"sim-{run_id}-u{i}", PERSONAS[i % len(PERSONAS)])
|
(f"sim-{run_id}-u{i}", PERSONAS[i % len(PERSONAS)])
|
||||||
for i in range(n_users)
|
for i in range(n_users)
|
||||||
@@ -130,6 +153,26 @@ def run_simulation(
|
|||||||
}
|
}
|
||||||
events: list[dict] = []
|
events: list[dict] = []
|
||||||
|
|
||||||
|
mlflow_run_id: str | None = None
|
||||||
|
mlflow_ctx = (
|
||||||
|
mlflow.start_run(run_name=run_id)
|
||||||
|
if (_MLFLOW_AVAILABLE and mlflow_url)
|
||||||
|
else None
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
if mlflow_ctx:
|
||||||
|
active = mlflow_ctx.__enter__()
|
||||||
|
mlflow_run_id = active.info.run_id
|
||||||
|
mlflow.log_params({
|
||||||
|
"n_users": n_users,
|
||||||
|
"n_rounds": n_rounds,
|
||||||
|
"tasks_per_round": tasks_per_round,
|
||||||
|
"policies": ",".join(policies),
|
||||||
|
"judge": "llm" if use_llm else "rule",
|
||||||
|
"seed": seed,
|
||||||
|
})
|
||||||
|
|
||||||
with httpx.Client(trust_env=False) as client:
|
with httpx.Client(trust_env=False) as client:
|
||||||
for rnd in range(n_rounds):
|
for rnd in range(n_rounds):
|
||||||
hour = rng.randint(6, 22)
|
hour = rng.randint(6, 22)
|
||||||
@@ -139,8 +182,6 @@ def run_simulation(
|
|||||||
for user_id, persona in user_personas:
|
for user_id, persona in user_personas:
|
||||||
seed_tasks = rnd * 997 + abs(hash(user_id)) % 997
|
seed_tasks = rnd * 997 + abs(hash(user_id)) % 997
|
||||||
tasks = generate_task_pool(n=tasks_per_round, seed=seed_tasks)
|
tasks = generate_task_pool(n=tasks_per_round, seed=seed_tasks)
|
||||||
|
|
||||||
# Per-persona profile features for v2 (synthetic for sim — see ADR-0012)
|
|
||||||
profile = persona.profile_features(hour) if hasattr(persona, "profile_features") else None
|
profile = persona.profile_features(hour) if hasattr(persona, "profile_features") else None
|
||||||
|
|
||||||
for policy in policies:
|
for policy in policies:
|
||||||
@@ -179,13 +220,34 @@ def run_simulation(
|
|||||||
prev = acc[p]["cumulative_rewards"][-1] if acc[p]["cumulative_rewards"] else 0.0
|
prev = acc[p]["cumulative_rewards"][-1] if acc[p]["cumulative_rewards"] else 0.0
|
||||||
acc[p]["cumulative_rewards"].append(prev + round_rewards[p])
|
acc[p]["cumulative_rewards"].append(prev + round_rewards[p])
|
||||||
|
|
||||||
|
if mlflow_ctx:
|
||||||
|
for p in policies:
|
||||||
|
mlflow.log_metric(f"{p}_cumulative_reward",
|
||||||
|
acc[p]["cumulative_rewards"][-1], step=rnd)
|
||||||
|
|
||||||
mode = "llm" if use_llm else "rule"
|
mode = "llm" if use_llm else "rule"
|
||||||
print(f" Round {rnd+1:>3}/{n_rounds} [{mode}] " + " ".join(
|
print(f" Round {rnd+1:>3}/{n_rounds} [{mode}] " + " ".join(
|
||||||
f"{p}={acc[p]['cumulative_rewards'][-1]:+.2f}" for p in policies
|
f"{p}={acc[p]['cumulative_rewards'][-1]:+.2f}" for p in policies
|
||||||
))
|
))
|
||||||
|
|
||||||
return _build_result(run_id, started_at, policies, acc, events,
|
result = _build_result(run_id, started_at, policies, acc, events,
|
||||||
n_users, n_rounds, tasks_per_round, use_llm, seed)
|
n_users, n_rounds, tasks_per_round, use_llm, seed)
|
||||||
|
result["mlflow_run_id"] = mlflow_run_id
|
||||||
|
|
||||||
|
if mlflow_ctx:
|
||||||
|
for p, s in result["summary"].items():
|
||||||
|
mlflow.log_metrics({
|
||||||
|
f"{p}_total_reward": s["total_reward"],
|
||||||
|
f"{p}_mean_reward": s["mean_reward"],
|
||||||
|
f"{p}_n_pulls": s["n_pulls"],
|
||||||
|
})
|
||||||
|
mlflow.set_tag("winner", result["winner"])
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
finally:
|
||||||
|
if mlflow_ctx:
|
||||||
|
mlflow_ctx.__exit__(None, None, None)
|
||||||
|
|
||||||
|
|
||||||
# ── Claude Code judge — phase 1: score ─────────────────────────────────────
|
# ── Claude Code judge — phase 1: score ─────────────────────────────────────
|
||||||
@@ -494,6 +556,9 @@ if __name__ == "__main__":
|
|||||||
help="Alias for --judge rule (backwards compat)")
|
help="Alias for --judge rule (backwards compat)")
|
||||||
parser.add_argument("--seed", type=int, default=42)
|
parser.add_argument("--seed", type=int, default=42)
|
||||||
parser.add_argument("--out", default=None)
|
parser.add_argument("--out", default=None)
|
||||||
|
parser.add_argument("--mlflow-url", default=os.environ.get("MLFLOW_TRACKING_URI"),
|
||||||
|
help="MLflow tracking URI (e.g. http://mlflow:5000/mlflow)")
|
||||||
|
parser.add_argument("--mlflow-experiment", default="bandit_simulation")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
if args.no_llm:
|
if args.no_llm:
|
||||||
@@ -534,6 +599,7 @@ if __name__ == "__main__":
|
|||||||
n_users=args.n_users, n_rounds=args.n_rounds,
|
n_users=args.n_users, n_rounds=args.n_rounds,
|
||||||
tasks_per_round=args.tasks_per_round, ml_url=args.ml_url,
|
tasks_per_round=args.tasks_per_round, ml_url=args.ml_url,
|
||||||
policies=args.policies, use_llm=use_llm, seed=args.seed,
|
policies=args.policies, use_llm=use_llm, seed=args.seed,
|
||||||
|
mlflow_url=args.mlflow_url, mlflow_experiment=args.mlflow_experiment,
|
||||||
)
|
)
|
||||||
Path(out_path).write_text(json.dumps(result, indent=2))
|
Path(out_path).write_text(json.dumps(result, indent=2))
|
||||||
print()
|
print()
|
||||||
|
|||||||
@@ -1,3 +1,8 @@
|
|||||||
from .context import build_context, PromptContext, TaskSignal
|
from .context import build_context, PromptContext, TaskSignal, ContextFeatureSpec, CONTEXT_FEATURES
|
||||||
|
from .profile_schema import ProfileFeature, PROFILE_FEATURES, feature_names
|
||||||
|
|
||||||
__all__ = ["build_context", "PromptContext", "TaskSignal"]
|
__all__ = [
|
||||||
|
"build_context", "PromptContext", "TaskSignal",
|
||||||
|
"ContextFeatureSpec", "CONTEXT_FEATURES",
|
||||||
|
"ProfileFeature", "PROFILE_FEATURES", "feature_names",
|
||||||
|
]
|
||||||
|
|||||||
@@ -2,12 +2,56 @@
|
|||||||
Context assembler — converts raw user signals into a PromptContext for LLM tip generation.
|
Context assembler — converts raw user signals into a PromptContext for LLM tip generation.
|
||||||
|
|
||||||
Usage:
|
Usage:
|
||||||
from ml.features.context import build_context
|
from ml.features.context import build_context, CONTEXT_FEATURES
|
||||||
ctx = build_context(tasks, hour_of_day=9, day_of_week=2)
|
ctx = build_context(tasks, hour_of_day=9, day_of_week=2)
|
||||||
|
|
||||||
|
Feature-spec (issue #61):
|
||||||
|
All context features are JIT — they are assembled at request time from live
|
||||||
|
sources (system clock, caller-supplied task list) rather than read from a
|
||||||
|
cached profile store. They carry no TTL because they are never persisted.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
from dataclasses import dataclass, field
|
from dataclasses import dataclass, field
|
||||||
|
from typing import Literal
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class ContextFeatureSpec:
|
||||||
|
name: str
|
||||||
|
dtype: Literal["numeric", "categorical", "list"]
|
||||||
|
freshness: Literal["jit", "batched"]
|
||||||
|
source: str
|
||||||
|
fallback: str
|
||||||
|
description: str
|
||||||
|
|
||||||
|
|
||||||
|
CONTEXT_FEATURES: tuple[ContextFeatureSpec, ...] = (
|
||||||
|
ContextFeatureSpec(
|
||||||
|
name="hour_of_day",
|
||||||
|
dtype="numeric",
|
||||||
|
freshness="jit",
|
||||||
|
source="request",
|
||||||
|
fallback="12",
|
||||||
|
description="Current hour (0–23), supplied by the caller at score time.",
|
||||||
|
),
|
||||||
|
ContextFeatureSpec(
|
||||||
|
name="day_of_week",
|
||||||
|
dtype="numeric",
|
||||||
|
freshness="jit",
|
||||||
|
source="request",
|
||||||
|
fallback="0",
|
||||||
|
description="ISO weekday (0=Monday … 6=Sunday), supplied by the caller at score time.",
|
||||||
|
),
|
||||||
|
ContextFeatureSpec(
|
||||||
|
name="tasks",
|
||||||
|
dtype="list",
|
||||||
|
freshness="jit",
|
||||||
|
source="todoist-integration",
|
||||||
|
fallback="[]",
|
||||||
|
description="User's open tasks fetched live from the Todoist integration at request time.",
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
|
|||||||
@@ -8,6 +8,12 @@ code (ml/serving, eval harnesses, notebooks) knows what fields to expect on
|
|||||||
|
|
||||||
Update this file whenever you add or rename a feature in the TS registry.
|
Update this file whenever you add or rename a feature in the TS registry.
|
||||||
The accompanying test asserts the two stay in sync at the name level.
|
The accompanying test asserts the two stay in sync at the name level.
|
||||||
|
|
||||||
|
Feature-spec fields (issue #61):
|
||||||
|
freshness — "batched": value cached in profile store, recomputed on TTL/event.
|
||||||
|
ttl_sec — cache lifetime in seconds; mirrors ``ttlSec`` in registry.ts.
|
||||||
|
source — where the value originates.
|
||||||
|
fallback — raw value returned when the feature is unavailable (null stored).
|
||||||
"""
|
"""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
@@ -16,6 +22,10 @@ from typing import Literal
|
|||||||
|
|
||||||
|
|
||||||
Dtype = Literal["numeric", "categorical"]
|
Dtype = Literal["numeric", "categorical"]
|
||||||
|
Freshness = Literal["jit", "batched"]
|
||||||
|
|
||||||
|
_HOUR = 3600
|
||||||
|
_DAY = 86_400
|
||||||
|
|
||||||
|
|
||||||
@dataclass(frozen=True)
|
@dataclass(frozen=True)
|
||||||
@@ -23,28 +33,57 @@ class ProfileFeature:
|
|||||||
name: str
|
name: str
|
||||||
dtype: Dtype
|
dtype: Dtype
|
||||||
description: str
|
description: str
|
||||||
|
freshness: Freshness
|
||||||
|
ttl_sec: int
|
||||||
|
source: str
|
||||||
|
fallback: str
|
||||||
|
|
||||||
|
|
||||||
PROFILE_FEATURES: tuple[ProfileFeature, ...] = (
|
PROFILE_FEATURES: tuple[ProfileFeature, ...] = (
|
||||||
ProfileFeature(
|
ProfileFeature(
|
||||||
"completion_rate_30d", "numeric",
|
name="completion_rate_30d",
|
||||||
'Fraction of tips served in the last 30 days that received a "done" reaction.',
|
dtype="numeric",
|
||||||
|
description='Fraction of tips served in the last 30 days that received a "done" reaction.',
|
||||||
|
freshness="batched",
|
||||||
|
ttl_sec=6 * _HOUR,
|
||||||
|
source="profile_store",
|
||||||
|
fallback="0.0",
|
||||||
),
|
),
|
||||||
ProfileFeature(
|
ProfileFeature(
|
||||||
"dismiss_rate_30d", "numeric",
|
name="dismiss_rate_30d",
|
||||||
'Fraction of tips served in the last 30 days that received a "dismiss" reaction.',
|
dtype="numeric",
|
||||||
|
description='Fraction of tips served in the last 30 days that received a "dismiss" reaction.',
|
||||||
|
freshness="batched",
|
||||||
|
ttl_sec=6 * _HOUR,
|
||||||
|
source="profile_store",
|
||||||
|
fallback="0.0",
|
||||||
),
|
),
|
||||||
ProfileFeature(
|
ProfileFeature(
|
||||||
"mean_dwell_ms_30d", "numeric",
|
name="mean_dwell_ms_30d",
|
||||||
"Average dwell time (ms between served and reacted) over the last 30 days.",
|
dtype="numeric",
|
||||||
|
description="Average dwell time (ms between served and reacted) over the last 30 days.",
|
||||||
|
freshness="batched",
|
||||||
|
ttl_sec=6 * _HOUR,
|
||||||
|
source="profile_store",
|
||||||
|
fallback="null — serving normalises to 0.0",
|
||||||
),
|
),
|
||||||
ProfileFeature(
|
ProfileFeature(
|
||||||
"preferred_hour", "numeric",
|
name="preferred_hour",
|
||||||
'Hour-of-day with the most "done" reactions in the last 30 days (0-23).',
|
dtype="numeric",
|
||||||
|
description='Hour-of-day with the most "done" reactions in the last 30 days (0–23).',
|
||||||
|
freshness="batched",
|
||||||
|
ttl_sec=_DAY,
|
||||||
|
source="profile_store",
|
||||||
|
fallback="null — serving normalises to 0.5 (neutral alignment)",
|
||||||
),
|
),
|
||||||
ProfileFeature(
|
ProfileFeature(
|
||||||
"tip_volume_30d", "numeric",
|
name="tip_volume_30d",
|
||||||
"Number of tips served to the user in the last 30 days.",
|
dtype="numeric",
|
||||||
|
description="Number of tips served to the user in the last 30 days.",
|
||||||
|
freshness="batched",
|
||||||
|
ttl_sec=_HOUR,
|
||||||
|
source="profile_store",
|
||||||
|
fallback="0",
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -1,7 +1,7 @@
|
|||||||
"""Tests for ml/features/context.py"""
|
"""Tests for ml/features/context.py"""
|
||||||
import pytest
|
import pytest
|
||||||
import sys, os; sys.path.insert(0, os.path.dirname(__file__))
|
import sys, os; sys.path.insert(0, os.path.dirname(__file__))
|
||||||
from context import build_context, TaskSignal, PromptContext
|
from context import build_context, TaskSignal, PromptContext, CONTEXT_FEATURES
|
||||||
|
|
||||||
|
|
||||||
def test_empty_tasks():
|
def test_empty_tasks():
|
||||||
@@ -62,3 +62,30 @@ def test_due_date_none_preserved():
|
|||||||
tasks = [TaskSignal(id="x", content="No due", due_date=None)]
|
tasks = [TaskSignal(id="x", content="No due", due_date=None)]
|
||||||
ctx = build_context(tasks)
|
ctx = build_context(tasks)
|
||||||
assert ctx.tasks[0]["due_date"] is None
|
assert ctx.tasks[0]["due_date"] is None
|
||||||
|
|
||||||
|
|
||||||
|
# ── CONTEXT_FEATURES spec tests (issue #61) ──────────────────────────────────
|
||||||
|
|
||||||
|
def test_context_features_expected_names():
|
||||||
|
names = {f.name for f in CONTEXT_FEATURES}
|
||||||
|
assert names == {"hour_of_day", "day_of_week", "tasks"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_context_features_all_jit():
|
||||||
|
for f in CONTEXT_FEATURES:
|
||||||
|
assert f.freshness == "jit", f"{f.name}: expected freshness='jit', got {f.freshness!r}"
|
||||||
|
|
||||||
|
|
||||||
|
def test_context_features_source_set():
|
||||||
|
for f in CONTEXT_FEATURES:
|
||||||
|
assert f.source, f"{f.name}: source must not be empty"
|
||||||
|
|
||||||
|
|
||||||
|
def test_context_features_fallback_set():
|
||||||
|
for f in CONTEXT_FEATURES:
|
||||||
|
assert f.fallback, f"{f.name}: fallback must not be empty"
|
||||||
|
|
||||||
|
|
||||||
|
def test_context_features_no_duplicates():
|
||||||
|
names = [f.name for f in CONTEXT_FEATURES]
|
||||||
|
assert len(names) == len(set(names)), f"duplicate names: {names}"
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
"""Smoke test for profile_schema mirror (#81 phase A).
|
"""Smoke test for profile_schema mirror (#81 phase A, #61 freshness spec).
|
||||||
|
|
||||||
The TS registry in services/api/src/profile/registry.ts is the source of truth.
|
The TS registry in services/api/src/profile/registry.ts is the source of truth.
|
||||||
This test checks the names listed here match the registry by reading the TS
|
This test checks the names listed here match the registry by reading the TS
|
||||||
@@ -14,6 +14,18 @@ from ml.features.profile_schema import PROFILE_FEATURES, feature_names
|
|||||||
|
|
||||||
REGISTRY_PATH = Path(__file__).resolve().parents[2] / "services" / "api" / "src" / "profile" / "registry.ts"
|
REGISTRY_PATH = Path(__file__).resolve().parents[2] / "services" / "api" / "src" / "profile" / "registry.ts"
|
||||||
|
|
||||||
|
_HOUR = 3600
|
||||||
|
_DAY = 86_400
|
||||||
|
|
||||||
|
# Expected ttl_sec values mirrored from registry.ts — keeps the two in sync.
|
||||||
|
_EXPECTED_TTL: dict[str, int] = {
|
||||||
|
"completion_rate_30d": 6 * _HOUR,
|
||||||
|
"dismiss_rate_30d": 6 * _HOUR,
|
||||||
|
"mean_dwell_ms_30d": 6 * _HOUR,
|
||||||
|
"preferred_hour": _DAY,
|
||||||
|
"tip_volume_30d": _HOUR,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def _ts_registry_names() -> set[str]:
|
def _ts_registry_names() -> set[str]:
|
||||||
text = REGISTRY_PATH.read_text(encoding="utf-8")
|
text = REGISTRY_PATH.read_text(encoding="utf-8")
|
||||||
@@ -21,6 +33,35 @@ def _ts_registry_names() -> set[str]:
|
|||||||
return set(re.findall(r"name:\s*'([a-zA-Z0-9_]+)'", text))
|
return set(re.findall(r"name:\s*'([a-zA-Z0-9_]+)'", text))
|
||||||
|
|
||||||
|
|
||||||
|
def _ts_registry_ttls() -> dict[str, int]:
|
||||||
|
"""Parse ttlSec values from registry.ts (crude but sufficient for drift detection).
|
||||||
|
|
||||||
|
Handles TS symbolic constants (HOUR, DAY) and expressions like ``6 * HOUR``.
|
||||||
|
"""
|
||||||
|
text = REGISTRY_PATH.read_text(encoding="utf-8")
|
||||||
|
|
||||||
|
# Extract numeric constants: `const HOUR = 3600;` or `const DAY = 86_400;`
|
||||||
|
consts: dict[str, int] = {}
|
||||||
|
for m in re.finditer(r"const\s+([A-Z_]+)\s*=\s*([\d_]+)", text):
|
||||||
|
consts[m.group(1)] = int(m.group(2).replace("_", ""))
|
||||||
|
|
||||||
|
def _eval_expr(expr: str) -> int:
|
||||||
|
tokens = [t.strip() for t in expr.split("*")]
|
||||||
|
result = 1
|
||||||
|
for t in tokens:
|
||||||
|
result *= consts[t] if t in consts else int(t)
|
||||||
|
return result
|
||||||
|
|
||||||
|
result: dict[str, int] = {}
|
||||||
|
for block in re.split(r"\{", text):
|
||||||
|
name_m = re.search(r"name:\s*'([a-zA-Z0-9_]+)'", block)
|
||||||
|
# ttlSec may be a constant name, a number, or `N * CONST`
|
||||||
|
ttl_m = re.search(r"ttlSec:\s*([A-Za-z0-9_]+(?:\s*\*\s*[A-Za-z0-9_]+)?)", block)
|
||||||
|
if name_m and ttl_m:
|
||||||
|
result[name_m.group(1)] = _eval_expr(ttl_m.group(1))
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
def test_python_mirror_matches_ts_registry():
|
def test_python_mirror_matches_ts_registry():
|
||||||
py_names = feature_names()
|
py_names = feature_names()
|
||||||
ts_names = _ts_registry_names()
|
ts_names = _ts_registry_names()
|
||||||
@@ -39,3 +80,34 @@ def test_profile_schema_no_duplicates():
|
|||||||
def test_profile_schema_dtypes_known():
|
def test_profile_schema_dtypes_known():
|
||||||
for f in PROFILE_FEATURES:
|
for f in PROFILE_FEATURES:
|
||||||
assert f.dtype in {"numeric", "categorical"}
|
assert f.dtype in {"numeric", "categorical"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_all_profile_features_are_batched():
|
||||||
|
for f in PROFILE_FEATURES:
|
||||||
|
assert f.freshness == "batched", f"{f.name}: expected freshness='batched', got {f.freshness!r}"
|
||||||
|
|
||||||
|
|
||||||
|
def test_profile_feature_ttl_matches_ts_registry():
|
||||||
|
ts_ttls = _ts_registry_ttls()
|
||||||
|
for f in PROFILE_FEATURES:
|
||||||
|
assert f.name in ts_ttls, f"{f.name} not found in TS registry ttlSec parse"
|
||||||
|
assert f.ttl_sec == ts_ttls[f.name], (
|
||||||
|
f"{f.name}: Python ttl_sec={f.ttl_sec} != TS ttlSec={ts_ttls[f.name]}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_profile_feature_ttl_matches_expected():
|
||||||
|
for f in PROFILE_FEATURES:
|
||||||
|
assert f.ttl_sec == _EXPECTED_TTL[f.name], (
|
||||||
|
f"{f.name}: ttl_sec={f.ttl_sec}, expected {_EXPECTED_TTL[f.name]}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_profile_feature_source_is_profile_store():
|
||||||
|
for f in PROFILE_FEATURES:
|
||||||
|
assert f.source == "profile_store", f"{f.name}: unexpected source {f.source!r}"
|
||||||
|
|
||||||
|
|
||||||
|
def test_profile_feature_fallback_set():
|
||||||
|
for f in PROFILE_FEATURES:
|
||||||
|
assert f.fallback, f"{f.name}: fallback must not be empty"
|
||||||
|
|||||||
124
ml/pipelines/sim_dag.py
Normal file
124
ml/pipelines/sim_dag.py
Normal file
@@ -0,0 +1,124 @@
|
|||||||
|
"""
|
||||||
|
Airflow DAG: bandit_sim
|
||||||
|
|
||||||
|
Runs a bandit policy simulation and logs results to MLflow.
|
||||||
|
Triggered on-demand from the oO admin panel or manually from the Airflow UI.
|
||||||
|
|
||||||
|
Required conf keys (passed via dag_run.conf):
|
||||||
|
sim_run_id str — oO SQLite run ID for callback correlation
|
||||||
|
n_users int — number of synthetic users
|
||||||
|
n_rounds int — rounds per user
|
||||||
|
tasks_per_round int — candidate pool size per round
|
||||||
|
policies list — policy names to compare
|
||||||
|
judge_mode str — "rule" | "llm"
|
||||||
|
ml_url str — ml/serving URL (e.g. http://ml-serving:8000)
|
||||||
|
mlflow_url str — MLflow tracking URI (e.g. http://mlflow:5000/mlflow)
|
||||||
|
callback_url str — oO API callback endpoint
|
||||||
|
internal_token str — x-internal-token header value
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
|
||||||
|
from airflow import DAG
|
||||||
|
from airflow.operators.python import PythonOperator
|
||||||
|
|
||||||
|
|
||||||
|
def _run_sim(**context: object) -> dict:
|
||||||
|
conf: dict = context["dag_run"].conf or {}
|
||||||
|
|
||||||
|
n_users = int(conf.get("n_users", 5))
|
||||||
|
n_rounds = int(conf.get("n_rounds", 20))
|
||||||
|
tasks_per_round = int(conf.get("tasks_per_round", 8))
|
||||||
|
policies = list(conf.get("policies", ["linucb-v1", "egreedy-v1"]))
|
||||||
|
judge_mode = str(conf.get("judge_mode", "rule"))
|
||||||
|
ml_url = str(conf.get("ml_url", "http://ml-serving:8000"))
|
||||||
|
mlflow_url = str(conf.get("mlflow_url", os.environ.get("MLFLOW_TRACKING_URI", "")))
|
||||||
|
mlflow_experiment = "bandit_simulation"
|
||||||
|
|
||||||
|
sys.path.insert(0, "/opt/airflow/ml/experiments/sim")
|
||||||
|
from runner import run_simulation # type: ignore[import]
|
||||||
|
|
||||||
|
use_llm = judge_mode == "llm"
|
||||||
|
result = run_simulation(
|
||||||
|
n_users=n_users,
|
||||||
|
n_rounds=n_rounds,
|
||||||
|
tasks_per_round=tasks_per_round,
|
||||||
|
ml_url=ml_url,
|
||||||
|
policies=policies,
|
||||||
|
use_llm=use_llm,
|
||||||
|
seed=42,
|
||||||
|
mlflow_url=mlflow_url or None,
|
||||||
|
mlflow_experiment=mlflow_experiment,
|
||||||
|
)
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def _callback(**context: object) -> None:
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
conf: dict = context["dag_run"].conf or {}
|
||||||
|
callback_url: str = str(conf.get("callback_url", ""))
|
||||||
|
internal_token: str = str(conf.get("internal_token", ""))
|
||||||
|
|
||||||
|
if not callback_url or not internal_token:
|
||||||
|
print("No callback_url or internal_token — skipping result push.", flush=True)
|
||||||
|
return
|
||||||
|
|
||||||
|
result: dict = context["ti"].xcom_pull(task_ids="run_sim")
|
||||||
|
if not result:
|
||||||
|
print("No result from run_sim task — callback skipped.", flush=True)
|
||||||
|
return
|
||||||
|
|
||||||
|
payload = {
|
||||||
|
"summary": result.get("summary", {}),
|
||||||
|
"winner": result.get("winner", ""),
|
||||||
|
"persona_breakdown": result.get("persona_breakdown", {}),
|
||||||
|
"events": result.get("events", []),
|
||||||
|
"mlflow_run_id": result.get("mlflow_run_id"),
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
r = httpx.post(
|
||||||
|
callback_url,
|
||||||
|
json=payload,
|
||||||
|
headers={"x-internal-token": internal_token},
|
||||||
|
timeout=30.0,
|
||||||
|
)
|
||||||
|
r.raise_for_status()
|
||||||
|
print(f"Callback OK: {r.status_code}", flush=True)
|
||||||
|
except Exception as exc:
|
||||||
|
print(f"Callback failed: {exc}", flush=True)
|
||||||
|
raise
|
||||||
|
|
||||||
|
|
||||||
|
with DAG(
|
||||||
|
dag_id="bandit_sim",
|
||||||
|
description="On-demand bandit policy simulation with MLflow tracking",
|
||||||
|
schedule_interval=None,
|
||||||
|
start_date=datetime(2025, 1, 1),
|
||||||
|
catchup=False,
|
||||||
|
tags=["bandit", "simulation", "ml"],
|
||||||
|
default_args={
|
||||||
|
"retries": 1,
|
||||||
|
"retry_delay": timedelta(minutes=2),
|
||||||
|
},
|
||||||
|
) as dag:
|
||||||
|
|
||||||
|
run_sim = PythonOperator(
|
||||||
|
task_id="run_sim",
|
||||||
|
python_callable=_run_sim,
|
||||||
|
provide_context=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
push_results = PythonOperator(
|
||||||
|
task_id="push_results",
|
||||||
|
python_callable=_callback,
|
||||||
|
provide_context=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
run_sim >> push_results
|
||||||
104
ml/serving/README.md
Normal file
104
ml/serving/README.md
Normal file
@@ -0,0 +1,104 @@
|
|||||||
|
# ml/serving
|
||||||
|
|
||||||
|
FastAPI online scorer, tip generator, and JetStream consumer.
|
||||||
|
|
||||||
|
## Contract
|
||||||
|
|
||||||
|
| Endpoint | Description |
|
||||||
|
|----------|-------------|
|
||||||
|
| `POST /score` | LinUCB d=5 (baseline, shadow-eligible) |
|
||||||
|
| `POST /score/egreedy` | ε-greedy v1, d=7 (active policy — ADR-0007) |
|
||||||
|
| `POST /score/egreedy/v2` | ε-greedy v2, d=12 + profile features (shadow — ADR-0012) |
|
||||||
|
| `POST /reward` / `/reward/egreedy` / `/reward/egreedy/v2` | Online reward update per policy |
|
||||||
|
| `POST /generate` | LLM tip candidates via LiteLLM `tip-generator` alias |
|
||||||
|
| `GET /stats/{user_id}` / `/stats/egreedy/{user_id}` / `/stats/egreedy/v2/{user_id}` | Per-user policy stats |
|
||||||
|
| `GET /features/{user_id}` | Last 100 scored feature vectors (ring buffer) |
|
||||||
|
| `POST /reset/{user_id}` | Clear all per-user bandit state (admin) |
|
||||||
|
| `GET /health` | `{ ok, nats: { enabled, consumers: { signals, feedback } } }` |
|
||||||
|
|
||||||
|
Called by `services/api/src/recommender/` over HTTP. Contract is stable across policy swaps.
|
||||||
|
|
||||||
|
## Feature dimensions
|
||||||
|
|
||||||
|
| Policy | d | Extra dims vs previous |
|
||||||
|
|--------|---|------------------------|
|
||||||
|
| LinUCB v1 | 5 | hour_sin/cos, is_overdue, task_age, priority |
|
||||||
|
| ε-greedy v1 | 7 | + dow_sin/cos |
|
||||||
|
| ε-greedy v2 | 12 | + 5 profile features (ADR-0012) |
|
||||||
|
|
||||||
|
Profile features are computed by the TypeScript API and shipped on each `/score` call as `profile_features`. See `ml/README.md` and ADR-0011.
|
||||||
|
|
||||||
|
## JetStream consumers
|
||||||
|
|
||||||
|
On startup, `nats_consumer.py` registers two durable push consumers against NATS JetStream:
|
||||||
|
|
||||||
|
| Consumer | Stream | Subjects | Durable name |
|
||||||
|
|----------|--------|----------|--------------|
|
||||||
|
| signals | `signals` | `signals.>` | `feature-pipeline-signals` |
|
||||||
|
| feedback | `feedback` | `feedback.>` | `feature-pipeline-feedback` |
|
||||||
|
|
||||||
|
**Handled subjects:**
|
||||||
|
- `signals.task.synced` — writes `{last_sync_ts, task_count}` to `{STATE_DIR}/{user}_sync.json`
|
||||||
|
- `signals.tip.feedback` — logged for observability; reward update happens via the HTTP path in the recommender
|
||||||
|
|
||||||
|
**Payload validation:** each message is validated against the pydantic models in `schemas.py` (mirroring `packages/shared-types/events/oo/events/v1/`). A `ValidationError` triggers a nak so the message is redelivered rather than silently dropped.
|
||||||
|
|
||||||
|
**Ack semantics:** explicit ack on success; nak for redelivery on error; dead-lettered after `NATS_MAX_DELIVER` attempts.
|
||||||
|
|
||||||
|
**Disabled** when `NATS_URL` is unset (default in local dev without NATS). No import of `nats-py` occurs in that case.
|
||||||
|
|
||||||
|
## Observability
|
||||||
|
|
||||||
|
Logs are structured JSON via **structlog**. Every line includes `level`, `logger`, `timestamp`, and — when a W3C `traceparent` header is present on the incoming request — `trace_id` bound via Python `contextvars`, so all log lines within a request carry the same trace ID as the upstream API call.
|
||||||
|
|
||||||
|
Sentry error capture is active when `SENTRY_DSN` is set.
|
||||||
|
|
||||||
|
## Config
|
||||||
|
|
||||||
|
| Env var | Default | Description |
|
||||||
|
|---------|---------|-------------|
|
||||||
|
| `STATE_DIR` | `/tmp/oo-bandit-state` | Directory for per-user bandit state JSON files |
|
||||||
|
| `LITELLM_URL` | `http://localhost:4000` | LiteLLM gateway |
|
||||||
|
| `LITELLM_MASTER_KEY` | `sk-oo-dev` | LiteLLM auth key |
|
||||||
|
| `NATS_URL` | `` | NATS broker URL; empty = consumers disabled |
|
||||||
|
| `NATS_DURABLE_PREFIX` | `feature-pipeline` | Prefix for durable consumer names |
|
||||||
|
| `NATS_MAX_DELIVER` | `5` | Max redelivery attempts before dropping |
|
||||||
|
| `DEFAULT_PROMPT_VERSION` | `v1` | Fallback prompt version for `/generate` |
|
||||||
|
| `ENV` | `development` | Environment label (passed to Sentry) |
|
||||||
|
| `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
|
||||||
|
|
||||||
|
## Health story
|
||||||
|
|
||||||
|
`GET /health` returns `{ ok: true }` plus NATS consumer state:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"ok": true,
|
||||||
|
"nats": {
|
||||||
|
"enabled": true,
|
||||||
|
"consumers": {
|
||||||
|
"signals": { "last_msg_ts": "2026-04-25T10:00:00Z", "processed": 42, "errors": 0 },
|
||||||
|
"feedback": { "last_msg_ts": null, "processed": 0, "errors": 0 }
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`last_msg_ts` is `null` until the first message arrives. Used by docker-compose healthcheck.
|
||||||
|
|
||||||
|
## Extraction criteria
|
||||||
|
|
||||||
|
Extract to its own process (already is one). Extract to a dedicated host / GPU node when:
|
||||||
|
- p99 scoring latency exceeds 50 ms under load, **or**
|
||||||
|
- model weights are too large to share memory with the Python process on the current host.
|
||||||
|
|
||||||
|
## State
|
||||||
|
|
||||||
|
Per-user bandit state is stored as JSON files in `STATE_DIR`:
|
||||||
|
|
||||||
|
| File pattern | Policy |
|
||||||
|
|---|---|
|
||||||
|
| `{user}.json` | LinUCB v1 |
|
||||||
|
| `{user}_egreedy.json` | ε-greedy v1 |
|
||||||
|
| `{user}_egreedy_v2.json` | ε-greedy v2 |
|
||||||
|
| `{user}_sync.json` | Last task sync metadata (written by JetStream consumer) |
|
||||||
20
ml/serving/logging_config.py
Normal file
20
ml/serving/logging_config.py
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
"""Structlog JSON configuration — import once at process start."""
|
||||||
|
import logging
|
||||||
|
import structlog
|
||||||
|
|
||||||
|
|
||||||
|
def configure() -> None:
|
||||||
|
structlog.configure(
|
||||||
|
processors=[
|
||||||
|
structlog.contextvars.merge_contextvars,
|
||||||
|
structlog.stdlib.add_log_level,
|
||||||
|
structlog.stdlib.add_logger_name,
|
||||||
|
structlog.processors.TimeStamper(fmt="iso"),
|
||||||
|
structlog.processors.StackInfoRenderer(),
|
||||||
|
structlog.processors.JSONRenderer(),
|
||||||
|
],
|
||||||
|
wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
|
||||||
|
context_class=dict,
|
||||||
|
logger_factory=structlog.PrintLoggerFactory(),
|
||||||
|
)
|
||||||
|
logging.basicConfig(level=logging.WARNING)
|
||||||
@@ -28,17 +28,55 @@ import math
|
|||||||
import os
|
import os
|
||||||
import time
|
import time
|
||||||
from collections import deque
|
from collections import deque
|
||||||
|
from contextlib import asynccontextmanager
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Optional, Deque
|
from typing import Optional, Deque
|
||||||
|
|
||||||
import httpx
|
import httpx
|
||||||
import numpy as np
|
import numpy as np
|
||||||
from fastapi import FastAPI, HTTPException
|
import sentry_sdk
|
||||||
|
import structlog
|
||||||
|
import structlog.contextvars
|
||||||
|
from fastapi import FastAPI, HTTPException, Request
|
||||||
from pydantic import BaseModel
|
from pydantic import BaseModel
|
||||||
|
from starlette.middleware.base import BaseHTTPMiddleware
|
||||||
|
|
||||||
|
import logging_config
|
||||||
|
import nats_consumer
|
||||||
from prompts import get_prompt
|
from prompts import get_prompt
|
||||||
|
|
||||||
app = FastAPI(title="oO ML Serving", version="1.0.0")
|
logging_config.configure()
|
||||||
|
|
||||||
|
_SENTRY_DSN = os.getenv("SENTRY_DSN")
|
||||||
|
if _SENTRY_DSN:
|
||||||
|
sentry_sdk.init(dsn=_SENTRY_DSN, environment=os.getenv("ENV", "development"))
|
||||||
|
|
||||||
|
log = structlog.get_logger()
|
||||||
|
|
||||||
|
|
||||||
|
@asynccontextmanager
|
||||||
|
async def lifespan(app: FastAPI):
|
||||||
|
await nats_consumer.start(STATE_DIR)
|
||||||
|
yield
|
||||||
|
await nats_consumer.stop()
|
||||||
|
|
||||||
|
|
||||||
|
app = FastAPI(title="oO ML Serving", version="1.0.0", lifespan=lifespan)
|
||||||
|
|
||||||
|
|
||||||
|
class _TracingMiddleware(BaseHTTPMiddleware):
|
||||||
|
async def dispatch(self, request: Request, call_next):
|
||||||
|
structlog.contextvars.clear_contextvars()
|
||||||
|
traceparent = request.headers.get("traceparent", "")
|
||||||
|
if traceparent:
|
||||||
|
parts = traceparent.split("-")
|
||||||
|
trace_id = parts[1] if len(parts) == 4 and len(parts[1]) == 32 else None
|
||||||
|
if trace_id:
|
||||||
|
structlog.contextvars.bind_contextvars(trace_id=trace_id)
|
||||||
|
return await call_next(request)
|
||||||
|
|
||||||
|
|
||||||
|
app.add_middleware(_TracingMiddleware)
|
||||||
|
|
||||||
LITELLM_URL = os.getenv("LITELLM_URL", "http://localhost:4000")
|
LITELLM_URL = os.getenv("LITELLM_URL", "http://localhost:4000")
|
||||||
LITELLM_MASTER_KEY = os.getenv("LITELLM_MASTER_KEY", "sk-oo-dev")
|
LITELLM_MASTER_KEY = os.getenv("LITELLM_MASTER_KEY", "sk-oo-dev")
|
||||||
@@ -315,7 +353,13 @@ class GenerateResponse(BaseModel):
|
|||||||
|
|
||||||
@app.get("/health")
|
@app.get("/health")
|
||||||
def health():
|
def health():
|
||||||
return {"ok": True}
|
return {
|
||||||
|
"ok": True,
|
||||||
|
"nats": {
|
||||||
|
"enabled": bool(nats_consumer.NATS_URL),
|
||||||
|
"consumers": nats_consumer.consumer_health,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
_RETRY_SUFFIX = (
|
_RETRY_SUFFIX = (
|
||||||
|
|||||||
146
ml/serving/nats_consumer.py
Normal file
146
ml/serving/nats_consumer.py
Normal file
@@ -0,0 +1,146 @@
|
|||||||
|
"""
|
||||||
|
JetStream durable consumers for ml/serving.
|
||||||
|
|
||||||
|
Streams:
|
||||||
|
signals (subjects: signals.>) — durable: {prefix}-signals
|
||||||
|
feedback (subjects: feedback.>) — durable: {prefix}-feedback
|
||||||
|
|
||||||
|
Handled subjects:
|
||||||
|
signals.task.synced → write per-user sync metadata to STATE_DIR
|
||||||
|
signals.tip.feedback → log for observability (reward is applied via HTTP path)
|
||||||
|
|
||||||
|
Config (env vars):
|
||||||
|
NATS_URL — broker URL; empty = consumers disabled (default: "")
|
||||||
|
NATS_DURABLE_PREFIX — prefix for durable consumer names (default: "feature-pipeline")
|
||||||
|
NATS_MAX_DELIVER — max redelivery attempts before dropping (default: 5)
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
import structlog
|
||||||
|
from schemas import TaskSyncedPayload, TipFeedbackPayload
|
||||||
|
|
||||||
|
log = structlog.get_logger(__name__)
|
||||||
|
|
||||||
|
NATS_URL = os.getenv("NATS_URL", "")
|
||||||
|
NATS_DURABLE_PREFIX = os.getenv("NATS_DURABLE_PREFIX", "feature-pipeline")
|
||||||
|
NATS_MAX_DELIVER = int(os.getenv("NATS_MAX_DELIVER", "5"))
|
||||||
|
|
||||||
|
# Exposed to /health
|
||||||
|
consumer_health: dict[str, dict] = {
|
||||||
|
"signals": {"last_msg_ts": None, "processed": 0, "errors": 0},
|
||||||
|
"feedback": {"last_msg_ts": None, "processed": 0, "errors": 0},
|
||||||
|
}
|
||||||
|
|
||||||
|
_nc = None # nats.aio.Client
|
||||||
|
_subs: list = [] # active JetStream subscriptions
|
||||||
|
|
||||||
|
|
||||||
|
# ── Subject handlers ───────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def _sync_meta_path(state_dir: Path, user_id: str) -> Path:
|
||||||
|
safe = "".join(c if c.isalnum() else "_" for c in user_id)
|
||||||
|
return state_dir / f"{safe}_sync.json"
|
||||||
|
|
||||||
|
|
||||||
|
async def _handle(subject: str, payload: dict, state_dir: Path) -> None:
|
||||||
|
if subject == "signals.task.synced":
|
||||||
|
msg = TaskSyncedPayload.model_validate(payload)
|
||||||
|
p = _sync_meta_path(state_dir, msg.userId)
|
||||||
|
p.write_text(json.dumps({
|
||||||
|
"last_sync_ts": msg.syncedAt,
|
||||||
|
"task_count": msg.count,
|
||||||
|
}))
|
||||||
|
log.info("nats: task_synced", user_id=msg.userId, count=msg.count)
|
||||||
|
elif subject == "signals.tip.feedback":
|
||||||
|
msg = TipFeedbackPayload.model_validate(payload)
|
||||||
|
log.info("nats: tip_feedback", user_id=msg.userId, tip_id=msg.tipId, action=msg.action, reward=msg.reward)
|
||||||
|
else:
|
||||||
|
log.debug("nats: unhandled subject", subject=subject)
|
||||||
|
|
||||||
|
|
||||||
|
# ── Consumer factory ───────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def _make_handler(key: str, state_dir: Path):
|
||||||
|
"""Return an async push-consumer callback that acks on success, naks on error."""
|
||||||
|
async def handler(msg) -> None:
|
||||||
|
consumer_health[key]["last_msg_ts"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
||||||
|
try:
|
||||||
|
payload = json.loads(msg.data)
|
||||||
|
await _handle(msg.subject, payload, state_dir)
|
||||||
|
await msg.ack()
|
||||||
|
consumer_health[key]["processed"] += 1
|
||||||
|
except Exception as exc:
|
||||||
|
consumer_health[key]["errors"] += 1
|
||||||
|
log.warning("nats: processing error", key=key, subject=msg.subject, exc=str(exc))
|
||||||
|
await msg.nak()
|
||||||
|
return handler
|
||||||
|
|
||||||
|
|
||||||
|
# ── Lifecycle ──────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
async def start(state_dir: Path) -> None:
|
||||||
|
"""Connect to NATS and register durable push consumers. No-op if NATS_URL is unset."""
|
||||||
|
global _nc
|
||||||
|
if not NATS_URL:
|
||||||
|
log.info("nats: NATS_URL unset — JetStream consumers disabled")
|
||||||
|
return
|
||||||
|
|
||||||
|
try:
|
||||||
|
import nats as nats_lib
|
||||||
|
from nats.js.api import ConsumerConfig, AckPolicy
|
||||||
|
|
||||||
|
_nc = await nats_lib.connect(
|
||||||
|
NATS_URL,
|
||||||
|
name="ml-serving",
|
||||||
|
reconnect_time_wait=5,
|
||||||
|
max_reconnect_attempts=-1,
|
||||||
|
)
|
||||||
|
js = _nc.jetstream()
|
||||||
|
log.info("nats: connected", url=NATS_URL)
|
||||||
|
except Exception as exc:
|
||||||
|
log.warning("nats: connection failed — consumers disabled", exc=str(exc))
|
||||||
|
_nc = None
|
||||||
|
return
|
||||||
|
|
||||||
|
config = ConsumerConfig(
|
||||||
|
ack_policy=AckPolicy.EXPLICIT,
|
||||||
|
max_deliver=NATS_MAX_DELIVER,
|
||||||
|
)
|
||||||
|
|
||||||
|
for key, subject in [("signals", "signals.>"), ("feedback", "feedback.>")]:
|
||||||
|
durable = f"{NATS_DURABLE_PREFIX}-{key}"
|
||||||
|
try:
|
||||||
|
sub = await js.subscribe(
|
||||||
|
subject,
|
||||||
|
durable=durable,
|
||||||
|
cb=_make_handler(key, state_dir),
|
||||||
|
config=config,
|
||||||
|
)
|
||||||
|
_subs.append(sub)
|
||||||
|
log.info("nats: subscribed", subject=subject, durable=durable)
|
||||||
|
except Exception as exc:
|
||||||
|
log.warning("nats: subscribe failed", key=key, exc=str(exc))
|
||||||
|
|
||||||
|
|
||||||
|
async def stop() -> None:
|
||||||
|
"""Drain subscriptions and close NATS connection."""
|
||||||
|
global _nc
|
||||||
|
for sub in _subs:
|
||||||
|
try:
|
||||||
|
await sub.unsubscribe()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
_subs.clear()
|
||||||
|
if _nc:
|
||||||
|
try:
|
||||||
|
await _nc.drain()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
_nc = None
|
||||||
|
log.info("nats: disconnected")
|
||||||
@@ -4,3 +4,6 @@ pydantic==2.10.4
|
|||||||
numpy>=1.26.0
|
numpy>=1.26.0
|
||||||
httpx>=0.27.0
|
httpx>=0.27.0
|
||||||
anthropic>=0.40.0
|
anthropic>=0.40.0
|
||||||
|
nats-py>=2.9.0
|
||||||
|
structlog>=24.1.0
|
||||||
|
sentry-sdk>=2.0.0
|
||||||
|
|||||||
50
ml/serving/schemas.py
Normal file
50
ml/serving/schemas.py
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
"""
|
||||||
|
Pydantic models mirroring oo.events.v1 proto schemas.
|
||||||
|
|
||||||
|
Field names use camelCase to match the proto3 JSON mapping convention
|
||||||
|
and the TypeScript payload shapes published by services/api.
|
||||||
|
|
||||||
|
Keep in sync with packages/shared-types/events/oo/events/v1/.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Literal, Optional
|
||||||
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
|
||||||
|
class TaskSyncedPayload(BaseModel):
|
||||||
|
userId: str
|
||||||
|
source: str
|
||||||
|
count: int
|
||||||
|
syncedAt: str
|
||||||
|
|
||||||
|
|
||||||
|
class TipServedPayload(BaseModel):
|
||||||
|
userId: str
|
||||||
|
tipId: str
|
||||||
|
policy: str
|
||||||
|
servedAt: str
|
||||||
|
|
||||||
|
|
||||||
|
class TipFeedbackPayload(BaseModel):
|
||||||
|
userId: str
|
||||||
|
tipId: str
|
||||||
|
action: Literal['done', 'dismiss', 'snooze', 'helpful', 'not_helpful']
|
||||||
|
reward: float
|
||||||
|
dwellMs: Optional[int] = None
|
||||||
|
createdAt: str
|
||||||
|
|
||||||
|
|
||||||
|
class TipRewardFailedPayload(BaseModel):
|
||||||
|
userId: str
|
||||||
|
tipId: str
|
||||||
|
reward: float
|
||||||
|
attempts: int
|
||||||
|
error: str
|
||||||
|
failedAt: str
|
||||||
|
|
||||||
|
|
||||||
|
class IntegrationTokenExpiredPayload(BaseModel):
|
||||||
|
userId: str
|
||||||
|
provider: str
|
||||||
|
detectedAt: str
|
||||||
169
ml/serving/tests/test_schemas_and_consumer.py
Normal file
169
ml/serving/tests/test_schemas_and_consumer.py
Normal file
@@ -0,0 +1,169 @@
|
|||||||
|
"""
|
||||||
|
Tests for schemas.py and nats_consumer._handle.
|
||||||
|
"""
|
||||||
|
import json
|
||||||
|
import pytest
|
||||||
|
import tempfile
|
||||||
|
from pathlib import Path
|
||||||
|
from pydantic import ValidationError
|
||||||
|
from unittest.mock import AsyncMock
|
||||||
|
|
||||||
|
from schemas import (
|
||||||
|
TaskSyncedPayload,
|
||||||
|
TipServedPayload,
|
||||||
|
TipFeedbackPayload,
|
||||||
|
TipRewardFailedPayload,
|
||||||
|
IntegrationTokenExpiredPayload,
|
||||||
|
)
|
||||||
|
from nats_consumer import _handle, _sync_meta_path
|
||||||
|
|
||||||
|
|
||||||
|
# ── Schema validation ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
class TestTaskSyncedPayload:
|
||||||
|
def test_valid(self):
|
||||||
|
p = TaskSyncedPayload.model_validate(
|
||||||
|
{"userId": "u1", "source": "todoist", "count": 5, "syncedAt": "2026-04-25T10:00:00Z"}
|
||||||
|
)
|
||||||
|
assert p.userId == "u1"
|
||||||
|
assert p.count == 5
|
||||||
|
|
||||||
|
def test_missing_field_raises(self):
|
||||||
|
with pytest.raises(ValidationError):
|
||||||
|
TaskSyncedPayload.model_validate({"userId": "u1", "source": "todoist"})
|
||||||
|
|
||||||
|
def test_wrong_type_raises(self):
|
||||||
|
with pytest.raises(ValidationError):
|
||||||
|
TaskSyncedPayload.model_validate(
|
||||||
|
{"userId": "u1", "source": "todoist", "count": "not-an-int", "syncedAt": "2026-04-25T10:00:00Z"}
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class TestTipFeedbackPayload:
|
||||||
|
def test_valid_without_dwell(self):
|
||||||
|
p = TipFeedbackPayload.model_validate(
|
||||||
|
{"userId": "u1", "tipId": "t1", "action": "done", "reward": 1.0, "createdAt": "2026-04-25T10:00:00Z"}
|
||||||
|
)
|
||||||
|
assert p.dwellMs is None
|
||||||
|
|
||||||
|
def test_valid_with_dwell(self):
|
||||||
|
p = TipFeedbackPayload.model_validate(
|
||||||
|
{"userId": "u1", "tipId": "t1", "action": "helpful", "reward": 0.5,
|
||||||
|
"dwellMs": 3200, "createdAt": "2026-04-25T10:00:00Z"}
|
||||||
|
)
|
||||||
|
assert p.dwellMs == 3200
|
||||||
|
|
||||||
|
def test_invalid_action_raises(self):
|
||||||
|
with pytest.raises(ValidationError):
|
||||||
|
TipFeedbackPayload.model_validate(
|
||||||
|
{"userId": "u1", "tipId": "t1", "action": "like", "reward": 1.0, "createdAt": "2026-04-25T10:00:00Z"}
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_all_valid_actions(self):
|
||||||
|
for action in ("done", "dismiss", "snooze", "helpful", "not_helpful"):
|
||||||
|
p = TipFeedbackPayload.model_validate(
|
||||||
|
{"userId": "u1", "tipId": "t1", "action": action, "reward": 0.0, "createdAt": "2026-04-25T10:00:00Z"}
|
||||||
|
)
|
||||||
|
assert p.action == action
|
||||||
|
|
||||||
|
|
||||||
|
class TestOtherPayloads:
|
||||||
|
def test_tip_served(self):
|
||||||
|
p = TipServedPayload.model_validate(
|
||||||
|
{"userId": "u1", "tipId": "t1", "policy": "egreedy-v2", "servedAt": "2026-04-25T10:00:00Z"}
|
||||||
|
)
|
||||||
|
assert p.policy == "egreedy-v2"
|
||||||
|
|
||||||
|
def test_tip_reward_failed(self):
|
||||||
|
p = TipRewardFailedPayload.model_validate(
|
||||||
|
{"userId": "u1", "tipId": "t1", "reward": 1.0, "attempts": 3,
|
||||||
|
"error": "timeout", "failedAt": "2026-04-25T10:00:00Z"}
|
||||||
|
)
|
||||||
|
assert p.attempts == 3
|
||||||
|
|
||||||
|
def test_integration_token_expired(self):
|
||||||
|
p = IntegrationTokenExpiredPayload.model_validate(
|
||||||
|
{"userId": "u1", "provider": "todoist", "detectedAt": "2026-04-25T10:00:00Z"}
|
||||||
|
)
|
||||||
|
assert p.provider == "todoist"
|
||||||
|
|
||||||
|
|
||||||
|
# ── _handle behaviour ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
TASK_SYNCED = {
|
||||||
|
"userId": "user-abc",
|
||||||
|
"source": "todoist",
|
||||||
|
"count": 7,
|
||||||
|
"syncedAt": "2026-04-25T10:00:00Z",
|
||||||
|
}
|
||||||
|
|
||||||
|
TIP_FEEDBACK = {
|
||||||
|
"userId": "user-abc",
|
||||||
|
"tipId": "tip-xyz",
|
||||||
|
"action": "done",
|
||||||
|
"reward": 1.0,
|
||||||
|
"dwellMs": 4200,
|
||||||
|
"createdAt": "2026-04-25T10:00:00Z",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class TestHandle:
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_task_synced_writes_meta_file(self):
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
state_dir = Path(tmp)
|
||||||
|
await _handle("signals.task.synced", TASK_SYNCED, state_dir)
|
||||||
|
meta_path = _sync_meta_path(state_dir, "user-abc")
|
||||||
|
assert meta_path.exists()
|
||||||
|
data = json.loads(meta_path.read_text())
|
||||||
|
assert data["task_count"] == 7
|
||||||
|
assert data["last_sync_ts"] == "2026-04-25T10:00:00Z"
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_task_synced_bad_payload_raises(self):
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
with pytest.raises(ValidationError):
|
||||||
|
await _handle("signals.task.synced", {"userId": "u1"}, Path(tmp))
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_tip_feedback_valid_does_not_raise(self):
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
# should log and return cleanly
|
||||||
|
await _handle("signals.tip.feedback", TIP_FEEDBACK, Path(tmp))
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_tip_feedback_bad_action_raises(self):
|
||||||
|
bad = {**TIP_FEEDBACK, "action": "unknown"}
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
with pytest.raises(ValidationError):
|
||||||
|
await _handle("signals.tip.feedback", bad, Path(tmp))
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_unhandled_subject_is_ignored(self):
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
# should not raise for unknown subjects
|
||||||
|
await _handle("signals.something.new", {"any": "data"}, Path(tmp))
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_make_handler_acks_on_success(self):
|
||||||
|
from nats_consumer import _make_handler
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
handler = _make_handler("signals", Path(tmp))
|
||||||
|
msg = AsyncMock()
|
||||||
|
msg.subject = "signals.task.synced"
|
||||||
|
msg.data = json.dumps(TASK_SYNCED).encode()
|
||||||
|
await handler(msg)
|
||||||
|
msg.ack.assert_awaited_once()
|
||||||
|
msg.nak.assert_not_awaited()
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_make_handler_naks_on_validation_error(self):
|
||||||
|
from nats_consumer import _make_handler
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
handler = _make_handler("signals", Path(tmp))
|
||||||
|
msg = AsyncMock()
|
||||||
|
msg.subject = "signals.task.synced"
|
||||||
|
msg.data = json.dumps({"userId": "u1"}).encode() # missing fields
|
||||||
|
await handler(msg)
|
||||||
|
msg.nak.assert_awaited_once()
|
||||||
|
msg.ack.assert_not_awaited()
|
||||||
63
packages/shared-types/README.md
Normal file
63
packages/shared-types/README.md
Normal file
@@ -0,0 +1,63 @@
|
|||||||
|
# @oo/shared-types
|
||||||
|
|
||||||
|
Canonical contracts for all inter-module communication. Two surfaces:
|
||||||
|
|
||||||
|
| Surface | Format | Location |
|
||||||
|
|---------|--------|----------|
|
||||||
|
| HTTP (sync) | OpenAPI / TypeScript interfaces | `src/http/` |
|
||||||
|
| Events (async) | Protocol Buffers + TS interfaces | `src/events/`, `events/` |
|
||||||
|
|
||||||
|
## HTTP types
|
||||||
|
|
||||||
|
Hand-written TypeScript interfaces generated from OpenAPI specs. Imported by
|
||||||
|
`services/api`, `apps/web`, and `ml/serving` (Python hand-mirrors).
|
||||||
|
|
||||||
|
| File | Types |
|
||||||
|
|------|-------|
|
||||||
|
| `src/http/tip.ts` | `TipCandidate`, `RecommendResponse`, `TipFeedback` |
|
||||||
|
| `src/http/auth.ts` | `SessionUser` |
|
||||||
|
| `src/http/integrations.ts` | `IntegrationsResponse`, `Integration` |
|
||||||
|
| `src/http/user.ts` | `UserProfile` |
|
||||||
|
| `src/http/signal.ts` | `Signal`, `SignalSource` |
|
||||||
|
|
||||||
|
## Event types
|
||||||
|
|
||||||
|
Protobuf schemas live in `events/oo/events/v1/`. TypeScript interfaces in
|
||||||
|
`src/events/index.ts` mirror the proto envelope and payload types.
|
||||||
|
|
||||||
|
| Proto file | Messages |
|
||||||
|
|------------|----------|
|
||||||
|
| `envelope.proto` | `Envelope` (wraps every event) |
|
||||||
|
| `signals.proto` | `TaskSyncedPayload`, `TipServedPayload`, `TipFeedbackPayload`, `TipRewardFailedPayload` |
|
||||||
|
| `integration.proto` | `IntegrationTokenExpiredPayload` |
|
||||||
|
|
||||||
|
**Schema evolution rules (ADR-0005):**
|
||||||
|
- Additive changes only within a version (new fields, new message types).
|
||||||
|
- Removed fields must be marked `reserved` — never reuse a field number.
|
||||||
|
- Breaking changes require a new package version (`oo.events.v2`) and a `schemaVersion` bump in the envelope.
|
||||||
|
|
||||||
|
## Schema registry / CI gate
|
||||||
|
|
||||||
|
`buf` enforces lint and breaking-change detection on every PR that touches `events/`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Lint
|
||||||
|
buf lint events/
|
||||||
|
|
||||||
|
# Breaking-change check against main
|
||||||
|
buf breaking events/ --against '.git#branch=main,subdir=packages/shared-types/events'
|
||||||
|
```
|
||||||
|
|
||||||
|
Local shortcut: `./scripts/buf-check.sh`
|
||||||
|
|
||||||
|
CI: `.gitea/workflows/buf-check.yaml` (requires a Gitea Actions runner).
|
||||||
|
|
||||||
|
Install buf: `curl -sSfL https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64 -o /usr/local/bin/buf && chmod +x /usr/local/bin/buf`
|
||||||
|
|
||||||
|
## Contract
|
||||||
|
|
||||||
|
`/health` — not applicable (library package, no process).
|
||||||
|
|
||||||
|
**Extraction criteria** — always a shared library. Extract to a separate registry
|
||||||
|
service only when schema governance requires independent versioning and deployment
|
||||||
|
(e.g. external consumers, SLA divergence from the monorepo).
|
||||||
7
packages/shared-types/events/buf.yaml
Normal file
7
packages/shared-types/events/buf.yaml
Normal file
@@ -0,0 +1,7 @@
|
|||||||
|
version: v1
|
||||||
|
lint:
|
||||||
|
use:
|
||||||
|
- STANDARD
|
||||||
|
breaking:
|
||||||
|
use:
|
||||||
|
- FILE
|
||||||
25
packages/shared-types/events/oo/events/v1/envelope.proto
Normal file
25
packages/shared-types/events/oo/events/v1/envelope.proto
Normal file
@@ -0,0 +1,25 @@
|
|||||||
|
syntax = "proto3";
|
||||||
|
package oo.events.v1;
|
||||||
|
|
||||||
|
import "oo/events/v1/signals.proto";
|
||||||
|
import "oo/events/v1/integration.proto";
|
||||||
|
|
||||||
|
// Envelope wraps every event on the bus and on NATS JetStream.
|
||||||
|
// Wire format: proto3 JSON (camelCase field names).
|
||||||
|
// schema_version = "v1" — bump to "v2" only for breaking payload changes.
|
||||||
|
message Envelope {
|
||||||
|
string event_id = 1; // UUID assigned by bus on publish
|
||||||
|
string occurred_at = 2; // ISO 8601
|
||||||
|
string schema_version = 3; // "v1"
|
||||||
|
string producer = 4; // e.g. "services/api"
|
||||||
|
string subject = 5; // NATS-style subject: domain.entity.verb
|
||||||
|
uint64 seq = 6; // monotonic sequence from the bus ring
|
||||||
|
|
||||||
|
oneof payload {
|
||||||
|
TaskSyncedPayload task_synced = 10;
|
||||||
|
TipServedPayload tip_served = 11;
|
||||||
|
TipFeedbackPayload tip_feedback = 12;
|
||||||
|
TipRewardFailedPayload tip_reward_failed = 13;
|
||||||
|
IntegrationTokenExpiredPayload integration_token_expired = 14;
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
syntax = "proto3";
|
||||||
|
package oo.events.v1;
|
||||||
|
|
||||||
|
// subject: signals.integration.token_expired
|
||||||
|
message IntegrationTokenExpiredPayload {
|
||||||
|
string user_id = 1;
|
||||||
|
string provider = 2;
|
||||||
|
string detected_at = 3; // ISO 8601
|
||||||
|
}
|
||||||
39
packages/shared-types/events/oo/events/v1/signals.proto
Normal file
39
packages/shared-types/events/oo/events/v1/signals.proto
Normal file
@@ -0,0 +1,39 @@
|
|||||||
|
syntax = "proto3";
|
||||||
|
package oo.events.v1;
|
||||||
|
|
||||||
|
// subject: signals.task.synced
|
||||||
|
message TaskSyncedPayload {
|
||||||
|
string user_id = 1;
|
||||||
|
string source = 2; // e.g. "todoist"
|
||||||
|
int32 count = 3;
|
||||||
|
string synced_at = 4; // ISO 8601
|
||||||
|
}
|
||||||
|
|
||||||
|
// subject: signals.tip.served
|
||||||
|
message TipServedPayload {
|
||||||
|
string user_id = 1;
|
||||||
|
string tip_id = 2;
|
||||||
|
string policy = 3;
|
||||||
|
string served_at = 4; // ISO 8601
|
||||||
|
}
|
||||||
|
|
||||||
|
// subject: signals.tip.feedback
|
||||||
|
// action: done | dismiss | snooze | helpful | not_helpful
|
||||||
|
message TipFeedbackPayload {
|
||||||
|
string user_id = 1;
|
||||||
|
string tip_id = 2;
|
||||||
|
string action = 3;
|
||||||
|
double reward = 4;
|
||||||
|
optional int64 dwell_ms = 5; // null when no dwell was recorded
|
||||||
|
string created_at = 6; // ISO 8601
|
||||||
|
}
|
||||||
|
|
||||||
|
// subject: signals.tip.reward_failed
|
||||||
|
message TipRewardFailedPayload {
|
||||||
|
string user_id = 1;
|
||||||
|
string tip_id = 2;
|
||||||
|
double reward = 3;
|
||||||
|
int32 attempts = 4;
|
||||||
|
string error = 5;
|
||||||
|
string failed_at = 6; // ISO 8601
|
||||||
|
}
|
||||||
@@ -15,7 +15,9 @@
|
|||||||
"test": "vitest run",
|
"test": "vitest run",
|
||||||
"test:watch": "vitest",
|
"test:watch": "vitest",
|
||||||
"type-check": "tsc --noEmit",
|
"type-check": "tsc --noEmit",
|
||||||
"clean": "rm -rf dist"
|
"clean": "rm -rf dist",
|
||||||
|
"buf:lint": "buf lint events",
|
||||||
|
"buf:breaking": "buf breaking events --against '.git#branch=main,subdir=packages/shared-types/events'"
|
||||||
},
|
},
|
||||||
"devDependencies": {
|
"devDependencies": {
|
||||||
"@vitest/coverage-v8": "^4.1.4",
|
"@vitest/coverage-v8": "^4.1.4",
|
||||||
|
|||||||
@@ -1,6 +1,6 @@
|
|||||||
/**
|
/**
|
||||||
* NormalizedEvent — the durable envelope for all events flowing through
|
* NormalizedEvent — the durable envelope for all events flowing through
|
||||||
* the system. Today: in-process EventEmitter. Tomorrow: NATS JetStream.
|
* the system. Mirrors oo.events.v1.Envelope in packages/shared-types/events/.
|
||||||
*
|
*
|
||||||
* Subject taxonomy:
|
* Subject taxonomy:
|
||||||
* signals.task.synced — Todoist (or other source) task list refreshed
|
* signals.task.synced — Todoist (or other source) task list refreshed
|
||||||
@@ -10,10 +10,16 @@
|
|||||||
* signals.integration.token_expired — OAuth token needs reconnect
|
* signals.integration.token_expired — OAuth token needs reconnect
|
||||||
*/
|
*/
|
||||||
export interface NormalizedEvent<T = unknown> {
|
export interface NormalizedEvent<T = unknown> {
|
||||||
|
/** UUID assigned by bus on publish */
|
||||||
|
eventId: string;
|
||||||
/** NATS-style subject: domain.entity.verb */
|
/** NATS-style subject: domain.entity.verb */
|
||||||
subject: string;
|
subject: string;
|
||||||
/** ISO 8601 timestamp */
|
/** ISO 8601 timestamp */
|
||||||
ts: string;
|
occurredAt: string;
|
||||||
|
/** "v1" — bump for breaking payload changes; see packages/shared-types/events/ */
|
||||||
|
schemaVersion: 'v1';
|
||||||
|
/** e.g. "services/api" */
|
||||||
|
producer: string;
|
||||||
/** Monotonically increasing sequence number (in-process ring; JetStream seq in prod) */
|
/** Monotonically increasing sequence number (in-process ring; JetStream seq in prod) */
|
||||||
seq: number;
|
seq: number;
|
||||||
payload: T;
|
payload: T;
|
||||||
|
|||||||
@@ -4,5 +4,6 @@
|
|||||||
"outDir": "dist",
|
"outDir": "dist",
|
||||||
"rootDir": "src"
|
"rootDir": "src"
|
||||||
},
|
},
|
||||||
"include": ["src"]
|
"include": ["src"],
|
||||||
|
"exclude": ["src/__tests__", "**/*.test.ts"]
|
||||||
}
|
}
|
||||||
|
|||||||
877
pnpm-lock.yaml
generated
877
pnpm-lock.yaml
generated
File diff suppressed because it is too large
Load Diff
24
scripts/buf-check.sh
Executable file
24
scripts/buf-check.sh
Executable file
@@ -0,0 +1,24 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Run buf lint and breaking-change detection locally.
|
||||||
|
# Usage: ./scripts/buf-check.sh [against-branch]
|
||||||
|
# Default against-branch: main
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
AGAINST="${1:-main}"
|
||||||
|
ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||||
|
EVENTS="$ROOT/packages/shared-types/events"
|
||||||
|
|
||||||
|
if ! command -v buf &>/dev/null; then
|
||||||
|
echo "buf not found. Install: https://buf.build/docs/installation"
|
||||||
|
echo " curl -sSfL https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64 -o /usr/local/bin/buf && chmod +x /usr/local/bin/buf"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "==> buf lint"
|
||||||
|
buf lint "$EVENTS"
|
||||||
|
|
||||||
|
echo "==> buf breaking against $AGAINST"
|
||||||
|
buf breaking "$EVENTS" \
|
||||||
|
--against ".git#branch=${AGAINST},subdir=packages/shared-types/events"
|
||||||
|
|
||||||
|
echo "All checks passed."
|
||||||
91
services/api/README.md
Normal file
91
services/api/README.md
Normal file
@@ -0,0 +1,91 @@
|
|||||||
|
# services/api
|
||||||
|
|
||||||
|
Express BFF that serves all client-facing routes, manages sessions, runs background signal sync, and proxies admin calls to `ml/serving`.
|
||||||
|
|
||||||
|
## Contract
|
||||||
|
|
||||||
|
```
|
||||||
|
GET /health { ok: true }
|
||||||
|
|
||||||
|
POST /api/auth/login → redirect to Google OAuth
|
||||||
|
GET /api/auth/callback OAuth return URL
|
||||||
|
POST /api/auth/logout
|
||||||
|
GET /api/auth/session → { user? }
|
||||||
|
POST /api/auth/token { token } → set sid cookie (ADMIN_TOKEN auth)
|
||||||
|
|
||||||
|
GET /api/integrations list connected integrations
|
||||||
|
POST /api/integrations/todoist/connect start Todoist OAuth
|
||||||
|
GET /api/integrations/todoist/callback
|
||||||
|
DELETE /api/integrations/:provider disconnect
|
||||||
|
|
||||||
|
POST /api/recommend → { tip }
|
||||||
|
POST /api/tip/:id/feedback { action } → { ok }
|
||||||
|
|
||||||
|
GET /api/user/profile
|
||||||
|
DELETE /api/user account deletion
|
||||||
|
|
||||||
|
POST /api/push/subscribe
|
||||||
|
DELETE /api/push/subscribe
|
||||||
|
|
||||||
|
GET /api/admin/stats DAU/WAU, feedback breakdown
|
||||||
|
GET /api/admin/users
|
||||||
|
GET /api/admin/events recent event stream (ring buffer)
|
||||||
|
GET /api/admin/sim/runs offline sim run list
|
||||||
|
POST /api/admin/sim/run launch offline sim
|
||||||
|
GET /api/admin/sim/runs/:id/output tail sim stdout
|
||||||
|
...
|
||||||
|
|
||||||
|
GET /api/ml/* admin-only proxy to ml/serving
|
||||||
|
```
|
||||||
|
|
||||||
|
## Middleware stack (request order)
|
||||||
|
|
||||||
|
1. `cors` — origin limited to `WEB_BASE_URL`
|
||||||
|
2. `tracingMiddleware` — reads or generates W3C `traceparent`; sets `req.traceId` + `req.traceparent`
|
||||||
|
3. `pinoHttp` — structured JSON request/response logs with `traceId` field; `/health` suppressed
|
||||||
|
4. `express.json()` / `cookieParser`
|
||||||
|
5. `sessionMiddleware` — validates `sid` cookie, attaches `req.userId`
|
||||||
|
|
||||||
|
## Observability
|
||||||
|
|
||||||
|
Logs are structured JSON via **pino**. Every line includes `traceId` (extracted from the incoming W3C `traceparent` header, or generated fresh). The same `traceparent` is forwarded on all outbound HTTP calls to `ml/serving` so traces correlate end-to-end.
|
||||||
|
|
||||||
|
Sentry error capture is active when `SENTRY_DSN` is set.
|
||||||
|
|
||||||
|
## Background tasks
|
||||||
|
|
||||||
|
- **Todoist sync scheduler** — runs every `TODOIST_SYNC_INTERVAL_MS` (default 15 min); starts 10 s after boot to avoid startup surge.
|
||||||
|
- **Retention purge** — deletes `tipScores` and `tipFeedback` rows older than 30 days; runs on boot and daily.
|
||||||
|
- **Profile TTL invalidation** — listens to `signals.task.synced` and `signals.tip.feedback` on the in-process Bus; invalidates cached user-level profile features so the next `/recommend` gets fresh values.
|
||||||
|
|
||||||
|
## Config
|
||||||
|
|
||||||
|
| Env var | Default | Description |
|
||||||
|
|---------|---------|-------------|
|
||||||
|
| `PORT` | `3001` | Listen port |
|
||||||
|
| `NODE_ENV` | `development` | Environment label |
|
||||||
|
| `DATABASE_PATH` | `./data/oo.db` | SQLite file |
|
||||||
|
| `SESSION_SECRET` | required | Cookie signing secret |
|
||||||
|
| `GOOGLE_CLIENT_ID/SECRET` | required | OAuth |
|
||||||
|
| `TODOIST_CLIENT_ID/SECRET` | required | OAuth |
|
||||||
|
| `API_BASE_URL` | `http://localhost:3001` | Self-referential redirect URI |
|
||||||
|
| `WEB_BASE_URL` | `http://localhost:3000` | CORS + post-login redirect |
|
||||||
|
| `ML_SERVING_URL` | `http://localhost:8000` | ml/serving base URL |
|
||||||
|
| `NATS_URL` | `` | NATS broker; empty = in-process bus only |
|
||||||
|
| `TODOIST_SYNC_INTERVAL_MS` | `900000` | Background sync cadence |
|
||||||
|
| `TIP_PROMPT_VERSION` | `` | Prompt variant(s) for `/generate` |
|
||||||
|
| `LOG_LEVEL` | `info` | pino log level |
|
||||||
|
| `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
|
||||||
|
| `VAPID_*` | | Web push keys |
|
||||||
|
| `ADMIN_TOKEN` | `` | Static token for service/Playwright admin auth; empty = disabled |
|
||||||
|
|
||||||
|
## Health story
|
||||||
|
|
||||||
|
`GET /health` returns `{ ok: true }`. No dependency checks — upstream deps (`ml/serving`, NATS) have their own health endpoints checked separately.
|
||||||
|
|
||||||
|
## Extraction criteria
|
||||||
|
|
||||||
|
Extract to its own host when:
|
||||||
|
- Auth session management needs a dedicated Redis/PG session store, **or**
|
||||||
|
- Background sync load (Todoist, future connectors) displaces API serving on the shared host, **or**
|
||||||
|
- Team boundary emerges between auth/BFF and recommender orchestration.
|
||||||
@@ -16,6 +16,7 @@
|
|||||||
},
|
},
|
||||||
"dependencies": {
|
"dependencies": {
|
||||||
"@oo/shared-types": "workspace:*",
|
"@oo/shared-types": "workspace:*",
|
||||||
|
"@sentry/node": "^10.50.0",
|
||||||
"better-sqlite3": "^11.8.1",
|
"better-sqlite3": "^11.8.1",
|
||||||
"cookie-parser": "^1.4.7",
|
"cookie-parser": "^1.4.7",
|
||||||
"cors": "^2.8.5",
|
"cors": "^2.8.5",
|
||||||
@@ -27,6 +28,8 @@
|
|||||||
"nats": "^2.29.3",
|
"nats": "^2.29.3",
|
||||||
"node-fetch": "^3.3.2",
|
"node-fetch": "^3.3.2",
|
||||||
"openid-client": "^6.3.4",
|
"openid-client": "^6.3.4",
|
||||||
|
"pino": "^10.3.1",
|
||||||
|
"pino-http": "^11.0.0",
|
||||||
"web-push": "^3.6.7",
|
"web-push": "^3.6.7",
|
||||||
"zod": "^3.24.1"
|
"zod": "^3.24.1"
|
||||||
},
|
},
|
||||||
|
|||||||
@@ -34,6 +34,17 @@ export const config = {
|
|||||||
ML_SERVING_URL: optional('ML_SERVING_URL', 'http://localhost:8000'),
|
ML_SERVING_URL: optional('ML_SERVING_URL', 'http://localhost:8000'),
|
||||||
LITELLM_URL: optional('LITELLM_URL', 'http://localhost:4000'),
|
LITELLM_URL: optional('LITELLM_URL', 'http://localhost:4000'),
|
||||||
|
|
||||||
|
MLFLOW_URL: optional('MLFLOW_URL', 'http://localhost:5000'),
|
||||||
|
AIRFLOW_URL: optional('AIRFLOW_URL', 'http://localhost:8080'),
|
||||||
|
AIRFLOW_API_USER: optional('AIRFLOW_API_USER', 'admin'),
|
||||||
|
AIRFLOW_API_PASSWORD: optional('AIRFLOW_API_PASSWORD', 'admin'),
|
||||||
|
|
||||||
|
/** Shared secret for internal Airflow→API callbacks. */
|
||||||
|
INTERNAL_API_TOKEN: optional('INTERNAL_API_TOKEN', ''),
|
||||||
|
|
||||||
|
/** Static token for automated/service access to the admin panel (e.g. Playwright tests). */
|
||||||
|
ADMIN_TOKEN: optional('ADMIN_TOKEN', ''),
|
||||||
|
|
||||||
VAPID_PUBLIC_KEY: optional('VAPID_PUBLIC_KEY', ''),
|
VAPID_PUBLIC_KEY: optional('VAPID_PUBLIC_KEY', ''),
|
||||||
VAPID_PRIVATE_KEY: optional('VAPID_PRIVATE_KEY', ''),
|
VAPID_PRIVATE_KEY: optional('VAPID_PRIVATE_KEY', ''),
|
||||||
VAPID_SUBJECT: optional('VAPID_SUBJECT', 'mailto:admin@localhost'),
|
VAPID_SUBJECT: optional('VAPID_SUBJECT', 'mailto:admin@localhost'),
|
||||||
|
|||||||
@@ -156,6 +156,10 @@ export function runMigrations() {
|
|||||||
`ALTER TABLE tip_scores ADD COLUMN prompt_version TEXT`,
|
`ALTER TABLE tip_scores ADD COLUMN prompt_version TEXT`,
|
||||||
`ALTER TABLE tip_scores ADD COLUMN llm_model TEXT`,
|
`ALTER TABLE tip_scores ADD COLUMN llm_model TEXT`,
|
||||||
`ALTER TABLE tip_scores ADD COLUMN tip_kind TEXT`,
|
`ALTER TABLE tip_scores ADD COLUMN tip_kind TEXT`,
|
||||||
|
`ALTER TABLE sim_runs ADD COLUMN airflow_dag_run_id TEXT`,
|
||||||
|
`ALTER TABLE sim_runs ADD COLUMN mlflow_run_id TEXT`,
|
||||||
|
`ALTER TABLE sim_runs ADD COLUMN judge_mode TEXT NOT NULL DEFAULT 'rule'`,
|
||||||
|
`ALTER TABLE sim_runs ADD COLUMN n_policies INTEGER NOT NULL DEFAULT 2`,
|
||||||
]) {
|
]) {
|
||||||
try { sqlite.exec(stmt); } catch { /* column already exists */ }
|
try { sqlite.exec(stmt); } catch { /* column already exists */ }
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -112,9 +112,13 @@ export const simRuns = sqliteTable('sim_runs', {
|
|||||||
tasksPerRound: integer('tasks_per_round').notNull().default(8),
|
tasksPerRound: integer('tasks_per_round').notNull().default(8),
|
||||||
useLlm: integer('use_llm', { mode: 'boolean' }).notNull().default(false),
|
useLlm: integer('use_llm', { mode: 'boolean' }).notNull().default(false),
|
||||||
status: text('status').notNull().default('pending'), // 'pending'|'running'|'done'|'failed'
|
status: text('status').notNull().default('pending'), // 'pending'|'running'|'done'|'failed'
|
||||||
|
judgeMode: text('judge_mode').notNull().default('rule'),
|
||||||
|
nPolicies: integer('n_policies').notNull().default(2),
|
||||||
summaryJson: text('summary_json'), // JSON: { [policy]: PolicySummary }
|
summaryJson: text('summary_json'), // JSON: { [policy]: PolicySummary }
|
||||||
winner: text('winner'),
|
winner: text('winner'),
|
||||||
personaBreakdownJson: text('persona_breakdown_json'), // JSON: { [persona]: { [policy]: {reward,n} } }
|
personaBreakdownJson: text('persona_breakdown_json'), // JSON: { [persona]: { [policy]: {reward,n} } }
|
||||||
|
airflowDagRunId: text('airflow_dag_run_id'),
|
||||||
|
mlflowRunId: text('mlflow_run_id'),
|
||||||
createdAt: text('created_at').notNull(),
|
createdAt: text('created_at').notNull(),
|
||||||
finishedAt: text('finished_at'),
|
finishedAt: text('finished_at'),
|
||||||
});
|
});
|
||||||
|
|||||||
@@ -56,7 +56,7 @@ describe('EventBus — delivery', () => {
|
|||||||
it('does not throw when publishing with no subscribers', () => {
|
it('does not throw when publishing with no subscribers', () => {
|
||||||
const b = makeBus();
|
const b = makeBus();
|
||||||
expect(() =>
|
expect(() =>
|
||||||
b.publish('signals.task.synced', { userId: 'u', count: 3, syncedAt: '' }),
|
b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 3, syncedAt: '' }),
|
||||||
).not.toThrow();
|
).not.toThrow();
|
||||||
});
|
});
|
||||||
|
|
||||||
@@ -101,7 +101,7 @@ describe('EventBus — ring buffer / tail()', () => {
|
|||||||
it('tail() filters by subject prefix', () => {
|
it('tail() filters by subject prefix', () => {
|
||||||
const b = makeBus();
|
const b = makeBus();
|
||||||
b.publish('signals.tip.served', { userId: 'u', tipId: 't', policy: 'p', servedAt: '' });
|
b.publish('signals.tip.served', { userId: 'u', tipId: 't', policy: 'p', servedAt: '' });
|
||||||
b.publish('signals.task.synced', { userId: 'u', count: 1, syncedAt: '' });
|
b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 1, syncedAt: '' });
|
||||||
|
|
||||||
const tipEvents = b.tail({ subject: 'signals.tip' });
|
const tipEvents = b.tail({ subject: 'signals.tip' });
|
||||||
expect(tipEvents.every((e) => e.subject.startsWith('signals.tip'))).toBe(true);
|
expect(tipEvents.every((e) => e.subject.startsWith('signals.tip'))).toBe(true);
|
||||||
@@ -178,7 +178,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
|
|||||||
const hook = vi.fn();
|
const hook = vi.fn();
|
||||||
b.onPublish(hook);
|
b.onPublish(hook);
|
||||||
|
|
||||||
const payload = { userId: 'u', count: 2, syncedAt: 'now' };
|
const payload = { userId: 'u', source: 'todoist', count: 2, syncedAt: 'now' };
|
||||||
b.publish('signals.task.synced', payload);
|
b.publish('signals.task.synced', payload);
|
||||||
|
|
||||||
expect(hook).toHaveBeenCalledOnce();
|
expect(hook).toHaveBeenCalledOnce();
|
||||||
@@ -191,7 +191,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
|
|||||||
b.onPublish(() => calls.push('a'));
|
b.onPublish(() => calls.push('a'));
|
||||||
b.onPublish(() => calls.push('b'));
|
b.onPublish(() => calls.push('b'));
|
||||||
|
|
||||||
b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' });
|
b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' });
|
||||||
expect(calls).toEqual(['a', 'b']);
|
expect(calls).toEqual(['a', 'b']);
|
||||||
});
|
});
|
||||||
|
|
||||||
@@ -202,7 +202,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
|
|||||||
b.onPublish(hook);
|
b.onPublish(hook);
|
||||||
b.subscribe('signals.task.synced', sub);
|
b.subscribe('signals.task.synced', sub);
|
||||||
|
|
||||||
b.publish('signals.task.synced', { userId: 'u', count: 1, syncedAt: '' });
|
b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 1, syncedAt: '' });
|
||||||
expect(hook).toHaveBeenCalledOnce();
|
expect(hook).toHaveBeenCalledOnce();
|
||||||
expect(sub).toHaveBeenCalledOnce();
|
expect(sub).toHaveBeenCalledOnce();
|
||||||
});
|
});
|
||||||
@@ -215,7 +215,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
|
|||||||
throw new Error('boom');
|
throw new Error('boom');
|
||||||
});
|
});
|
||||||
expect(() =>
|
expect(() =>
|
||||||
b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' }),
|
b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' }),
|
||||||
).toThrow('boom');
|
).toThrow('boom');
|
||||||
});
|
});
|
||||||
});
|
});
|
||||||
|
|||||||
@@ -106,7 +106,7 @@ describe('connectNats — bridge bus → JetStream', () => {
|
|||||||
|
|
||||||
await connectNats('nats://test:4222');
|
await connectNats('nats://test:4222');
|
||||||
|
|
||||||
const payload = { userId: 'u1', count: 7, syncedAt: '2026-01-01T00:00:00Z' };
|
const payload = { userId: 'u1', source: 'todoist', count: 7, syncedAt: '2026-01-01T00:00:00Z' };
|
||||||
bus.publish('signals.task.synced', payload);
|
bus.publish('signals.task.synced', payload);
|
||||||
|
|
||||||
// Allow the queued microtask in the hook to flush.
|
// Allow the queued microtask in the hook to flush.
|
||||||
@@ -121,16 +121,17 @@ describe('connectNats — bridge bus → JetStream', () => {
|
|||||||
|
|
||||||
it('swallows JetStream publish errors so the in-process bus keeps working', async () => {
|
it('swallows JetStream publish errors so the in-process bus keeps working', async () => {
|
||||||
const { connectNats } = await import('../nats.js');
|
const { connectNats } = await import('../nats.js');
|
||||||
|
const { logger } = await import('../../logger.js');
|
||||||
const { bus } = await import('../bus.js');
|
const { bus } = await import('../bus.js');
|
||||||
|
|
||||||
await connectNats('nats://test:4222');
|
await connectNats('nats://test:4222');
|
||||||
|
|
||||||
// Force the next js.publish to reject.
|
// Force the next js.publish to reject.
|
||||||
lastJsPublish.mockRejectedValueOnce(new Error('jetstream down'));
|
lastJsPublish.mockRejectedValueOnce(new Error('jetstream down'));
|
||||||
const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
|
const errSpy = vi.spyOn(logger, 'error');
|
||||||
|
|
||||||
expect(() =>
|
expect(() =>
|
||||||
bus.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' }),
|
bus.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' }),
|
||||||
).not.toThrow();
|
).not.toThrow();
|
||||||
|
|
||||||
// Wait a tick for the rejected promise's catch to run.
|
// Wait a tick for the rejected promise's catch to run.
|
||||||
@@ -142,12 +143,16 @@ describe('connectNats — bridge bus → JetStream', () => {
|
|||||||
describe('connectNats — failure mode', () => {
|
describe('connectNats — failure mode', () => {
|
||||||
it('logs a warning and stays silent when connect rejects', async () => {
|
it('logs a warning and stays silent when connect rejects', async () => {
|
||||||
const { connectNats } = await import('../nats.js');
|
const { connectNats } = await import('../nats.js');
|
||||||
|
const { logger } = await import('../../logger.js');
|
||||||
|
|
||||||
lastConnect.mockRejectedValueOnce(new Error('ECONNREFUSED'));
|
lastConnect.mockRejectedValueOnce(new Error('ECONNREFUSED'));
|
||||||
const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {});
|
const warnSpy = vi.spyOn(logger, 'warn');
|
||||||
|
|
||||||
await expect(connectNats('nats://nope:4222')).resolves.toBeUndefined();
|
await expect(connectNats('nats://nope:4222')).resolves.toBeUndefined();
|
||||||
expect(warnSpy).toHaveBeenCalledWith(expect.stringContaining('connection failed'));
|
expect(warnSpy).toHaveBeenCalledWith(
|
||||||
|
expect.objectContaining({ err: expect.anything() }),
|
||||||
|
expect.stringContaining('connection failed'),
|
||||||
|
);
|
||||||
});
|
});
|
||||||
});
|
});
|
||||||
|
|
||||||
@@ -156,7 +161,7 @@ describe('Bus.onPublish contract — used by NATS bridge', () => {
|
|||||||
const b = new Bus();
|
const b = new Bus();
|
||||||
const hook = vi.fn();
|
const hook = vi.fn();
|
||||||
b.onPublish(hook);
|
b.onPublish(hook);
|
||||||
b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' });
|
b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' });
|
||||||
expect(hook).toHaveBeenCalledOnce();
|
expect(hook).toHaveBeenCalledOnce();
|
||||||
});
|
});
|
||||||
});
|
});
|
||||||
|
|||||||
@@ -45,6 +45,7 @@ export type RewardDeliveryFailedEvent = {
|
|||||||
|
|
||||||
export type TaskSyncedEvent = {
|
export type TaskSyncedEvent = {
|
||||||
userId: string;
|
userId: string;
|
||||||
|
source: string; // e.g. 'todoist'
|
||||||
count: number;
|
count: number;
|
||||||
syncedAt: string;
|
syncedAt: string;
|
||||||
};
|
};
|
||||||
|
|||||||
@@ -12,6 +12,7 @@
|
|||||||
|
|
||||||
import type { NatsConnection, JetStreamClient, StreamConfig } from 'nats';
|
import type { NatsConnection, JetStreamClient, StreamConfig } from 'nats';
|
||||||
import { bus } from './bus.js';
|
import { bus } from './bus.js';
|
||||||
|
import { logger } from '../logger.js';
|
||||||
|
|
||||||
let nc: NatsConnection | null = null;
|
let nc: NatsConnection | null = null;
|
||||||
let js: JetStreamClient | null = null;
|
let js: JetStreamClient | null = null;
|
||||||
@@ -67,13 +68,13 @@ export async function connectNats(natsUrl: string): Promise<void> {
|
|||||||
if (!js) return;
|
if (!js) return;
|
||||||
const data = new TextEncoder().encode(JSON.stringify(payload));
|
const data = new TextEncoder().encode(JSON.stringify(payload));
|
||||||
js.publish(subject, data).catch((err: Error) =>
|
js.publish(subject, data).catch((err: Error) =>
|
||||||
console.error(`[nats] publish failed for ${subject}: ${err.message}`),
|
logger.error({ err, subject }, 'nats publish failed'),
|
||||||
);
|
);
|
||||||
});
|
});
|
||||||
|
|
||||||
console.log(`[nats] connected to ${natsUrl}, streams: ${STREAMS.map((s) => s.name).join(', ')}`);
|
logger.info({ url: natsUrl, streams: STREAMS.map((s) => s.name) }, 'nats connected');
|
||||||
} catch (err: any) {
|
} catch (err: any) {
|
||||||
console.warn(`[nats] connection failed — running without JetStream: ${err.message}`);
|
logger.warn({ err }, 'nats connection failed — running without JetStream');
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -1,7 +1,10 @@
|
|||||||
import 'dotenv/config';
|
import 'dotenv/config';
|
||||||
|
import { logger } from './logger.js';
|
||||||
import express from 'express';
|
import express from 'express';
|
||||||
|
import { pinoHttp } from 'pino-http';
|
||||||
import cookieParser from 'cookie-parser';
|
import cookieParser from 'cookie-parser';
|
||||||
import cors from 'cors';
|
import cors from 'cors';
|
||||||
|
import { tracingMiddleware } from './middleware/tracing.js';
|
||||||
import { config } from './config.js';
|
import { config } from './config.js';
|
||||||
import { db, runMigrations } from './db/index.js';
|
import { db, runMigrations } from './db/index.js';
|
||||||
import { tipScores, tipFeedback } from './db/schema.js';
|
import { tipScores, tipFeedback } from './db/schema.js';
|
||||||
@@ -12,7 +15,7 @@ import { integrationsRouter } from './routes/integrations.js';
|
|||||||
import { recommenderRouter } from './routes/recommender.js';
|
import { recommenderRouter } from './routes/recommender.js';
|
||||||
import { userRouter } from './routes/user.js';
|
import { userRouter } from './routes/user.js';
|
||||||
import { pushRouter } from './routes/push.js';
|
import { pushRouter } from './routes/push.js';
|
||||||
import { adminRouter } from './routes/admin.js';
|
import { adminRouter, adminInternalRouter } from './routes/admin.js';
|
||||||
import { mkdir } from 'fs/promises';
|
import { mkdir } from 'fs/promises';
|
||||||
import { dirname } from 'path';
|
import { dirname } from 'path';
|
||||||
import { requireAuth } from './middleware/session.js';
|
import { requireAuth } from './middleware/session.js';
|
||||||
@@ -26,13 +29,11 @@ import { registerProfileSubscriptions } from './profile/subscriber.js';
|
|||||||
await mkdir(dirname(config.DATABASE_PATH), { recursive: true });
|
await mkdir(dirname(config.DATABASE_PATH), { recursive: true });
|
||||||
runMigrations();
|
runMigrations();
|
||||||
|
|
||||||
// Keep the API alive on stray async faults (e.g. a single bad admin route)
|
|
||||||
// rather than dropping the whole process.
|
|
||||||
process.on('unhandledRejection', (reason) => {
|
process.on('unhandledRejection', (reason) => {
|
||||||
console.error('[api] unhandledRejection', reason);
|
logger.error({ err: reason }, 'unhandledRejection');
|
||||||
});
|
});
|
||||||
process.on('uncaughtException', (err) => {
|
process.on('uncaughtException', (err) => {
|
||||||
console.error('[api] uncaughtException', err);
|
logger.fatal({ err }, 'uncaughtException');
|
||||||
});
|
});
|
||||||
|
|
||||||
const app = express();
|
const app = express();
|
||||||
@@ -43,6 +44,15 @@ app.use(
|
|||||||
credentials: true,
|
credentials: true,
|
||||||
}),
|
}),
|
||||||
);
|
);
|
||||||
|
app.use(tracingMiddleware);
|
||||||
|
app.use(
|
||||||
|
pinoHttp({
|
||||||
|
logger,
|
||||||
|
genReqId: (req) => req.traceId,
|
||||||
|
customProps: (req) => ({ traceId: req.traceId }),
|
||||||
|
autoLogging: { ignore: (req) => req.url === '/health' },
|
||||||
|
}),
|
||||||
|
);
|
||||||
app.use(express.json());
|
app.use(express.json());
|
||||||
app.use(cookieParser());
|
app.use(cookieParser());
|
||||||
app.use(sessionMiddleware);
|
app.use(sessionMiddleware);
|
||||||
@@ -55,17 +65,15 @@ app.use('/api', recommenderRouter);
|
|||||||
app.use('/api/user', userRouter);
|
app.use('/api/user', userRouter);
|
||||||
app.use('/api/push', pushRouter);
|
app.use('/api/push', pushRouter);
|
||||||
app.use('/api/admin', adminRouter);
|
app.use('/api/admin', adminRouter);
|
||||||
|
app.use('/api/admin', adminInternalRouter);
|
||||||
|
|
||||||
// Proxy ml/serving endpoints through the API (admin-only).
|
|
||||||
// Allows admin UI to call /api/ml/stats/:userId, /api/ml/features/:userId
|
|
||||||
// without needing direct access to the ml/serving port.
|
|
||||||
app.use('/api/ml', requireAuth as any, requireAdmin as any, async (req: Request, res: Response) => {
|
app.use('/api/ml', requireAuth as any, requireAdmin as any, async (req: Request, res: Response) => {
|
||||||
const mlUrl = config.ML_SERVING_URL;
|
const mlUrl = config.ML_SERVING_URL;
|
||||||
const target = `${mlUrl}${req.path}`;
|
const target = `${mlUrl}${req.path}`;
|
||||||
try {
|
try {
|
||||||
const upstream = await fetch(target, {
|
const upstream = await fetch(target, {
|
||||||
method: req.method,
|
method: req.method,
|
||||||
headers: { 'Content-Type': 'application/json' },
|
headers: { 'Content-Type': 'application/json', traceparent: req.traceparent },
|
||||||
body: req.method !== 'GET' ? JSON.stringify(req.body) : undefined,
|
body: req.method !== 'GET' ? JSON.stringify(req.body) : undefined,
|
||||||
signal: AbortSignal.timeout(5000),
|
signal: AbortSignal.timeout(5000),
|
||||||
});
|
});
|
||||||
@@ -82,7 +90,7 @@ async function purgeExpiredData() {
|
|||||||
await db.delete(tipScores).where(lt(tipScores.servedAt, cutoff));
|
await db.delete(tipScores).where(lt(tipScores.servedAt, cutoff));
|
||||||
await db.delete(tipFeedback).where(lt(tipFeedback.createdAt, cutoff));
|
await db.delete(tipFeedback).where(lt(tipFeedback.createdAt, cutoff));
|
||||||
} catch (err: any) {
|
} catch (err: any) {
|
||||||
console.error(`[purge] retention cleanup failed: ${err.message}`);
|
logger.error({ err }, 'retention cleanup failed');
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -90,7 +98,7 @@ purgeExpiredData();
|
|||||||
setInterval(purgeExpiredData, 24 * 60 * 60 * 1000);
|
setInterval(purgeExpiredData, 24 * 60 * 60 * 1000);
|
||||||
|
|
||||||
app.listen(config.PORT, () => {
|
app.listen(config.PORT, () => {
|
||||||
console.log(`oO API listening on http://localhost:${config.PORT}`);
|
logger.info({ port: config.PORT }, 'oO API listening');
|
||||||
});
|
});
|
||||||
|
|
||||||
if (config.NATS_URL) {
|
if (config.NATS_URL) {
|
||||||
|
|||||||
12
services/api/src/logger.ts
Normal file
12
services/api/src/logger.ts
Normal file
@@ -0,0 +1,12 @@
|
|||||||
|
import pino from 'pino';
|
||||||
|
import * as Sentry from '@sentry/node';
|
||||||
|
|
||||||
|
if (process.env['SENTRY_DSN']) {
|
||||||
|
Sentry.init({
|
||||||
|
dsn: process.env['SENTRY_DSN'],
|
||||||
|
environment: process.env['NODE_ENV'] ?? 'development',
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
export const logger = pino({ level: process.env['LOG_LEVEL'] ?? 'info' });
|
||||||
|
export { Sentry };
|
||||||
26
services/api/src/middleware/tracing.ts
Normal file
26
services/api/src/middleware/tracing.ts
Normal file
@@ -0,0 +1,26 @@
|
|||||||
|
import { randomBytes } from 'crypto';
|
||||||
|
import type { Request, Response, NextFunction } from 'express';
|
||||||
|
|
||||||
|
declare global {
|
||||||
|
namespace Express {
|
||||||
|
interface Request {
|
||||||
|
traceId: string;
|
||||||
|
traceparent: string;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
export function tracingMiddleware(req: Request, _res: Response, next: NextFunction): void {
|
||||||
|
const incoming = req.headers['traceparent'] as string | undefined;
|
||||||
|
let traceId: string;
|
||||||
|
if (incoming) {
|
||||||
|
const parts = incoming.split('-');
|
||||||
|
traceId = parts.length === 4 && parts[1]?.length === 32 ? parts[1] : randomBytes(16).toString('hex');
|
||||||
|
} else {
|
||||||
|
traceId = randomBytes(16).toString('hex');
|
||||||
|
}
|
||||||
|
const parentId = randomBytes(8).toString('hex');
|
||||||
|
req.traceId = traceId;
|
||||||
|
req.traceparent = `00-${traceId}-${parentId}-01`;
|
||||||
|
next();
|
||||||
|
}
|
||||||
@@ -4,7 +4,7 @@
|
|||||||
* A real Express app + in-memory SQLite DB per test suite.
|
* A real Express app + in-memory SQLite DB per test suite.
|
||||||
* Auth and admin middleware are mocked so we can focus on route logic.
|
* Auth and admin middleware are mocked so we can focus on route logic.
|
||||||
*/
|
*/
|
||||||
import { describe, it, expect, vi, beforeAll } from 'vitest';
|
import { describe, it, expect, vi, beforeAll, afterEach } from 'vitest';
|
||||||
import express from 'express';
|
import express from 'express';
|
||||||
import * as http from 'http';
|
import * as http from 'http';
|
||||||
import { makeTestDb } from '../../test/db.js';
|
import { makeTestDb } from '../../test/db.js';
|
||||||
@@ -385,16 +385,126 @@ describe('GET /api/admin/events', () => {
|
|||||||
});
|
});
|
||||||
});
|
});
|
||||||
|
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
// Health endpoint — mock fetch so tests don't depend on running services.
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
describe('GET /api/admin/health', () => {
|
describe('GET /api/admin/health', () => {
|
||||||
it('returns 200 with ok, services array, and checkedAt', async () => {
|
const EXPECTED_HTTP_SERVICES = ['api', 'ml-serving', 'mlflow', 'airflow'] as const;
|
||||||
|
const EXPECTED_INTERNAL = ['sqlite', 'event-bus'] as const;
|
||||||
|
const VALID_STATUSES = new Set(['ok', 'degraded', 'down']);
|
||||||
|
|
||||||
|
type ServiceRow = { name: string; status: string; latencyMs: number };
|
||||||
|
type HealthBody = { ok: boolean; services: ServiceRow[]; checkedAt: string };
|
||||||
|
|
||||||
|
function mockFetch(upServices: Set<string>) {
|
||||||
|
// Resolve service name by port (matches defaults in config.ts).
|
||||||
|
// Up services return HTTP 200; absent ones throw (simulates connection refused → 'down').
|
||||||
|
vi.stubGlobal('fetch', async (url: string) => {
|
||||||
|
const s = String(url);
|
||||||
|
let name: string;
|
||||||
|
if (s.includes(':8000')) name = 'ml-serving';
|
||||||
|
else if (s.includes(':5000')) name = 'mlflow';
|
||||||
|
else if (s.includes(':8080')) name = 'airflow';
|
||||||
|
else name = 'api';
|
||||||
|
|
||||||
|
if (!upServices.has(name)) throw new Error(`ECONNREFUSED ${name}`);
|
||||||
|
return { ok: true, json: async () => ({ ok: true, status: 'healthy' }) };
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
afterEach(() => vi.unstubAllGlobals());
|
||||||
|
|
||||||
|
it('shape: 200, typed fields, all expected services present', async () => {
|
||||||
|
mockFetch(new Set(['api', 'ml-serving', 'mlflow', 'airflow']));
|
||||||
const { server, call } = await startServer(buildApp());
|
const { server, call } = await startServer(buildApp());
|
||||||
try {
|
try {
|
||||||
const { status, body } = await call('GET', '/api/admin/health');
|
const { status, body } = await call('GET', '/api/admin/health');
|
||||||
const b = body as { ok: boolean; services: { name: string; status: string }[]; checkedAt: string };
|
const b = body as HealthBody;
|
||||||
expect(status).toBe(200);
|
expect(status).toBe(200);
|
||||||
expect(typeof b.ok).toBe('boolean');
|
expect(typeof b.ok).toBe('boolean');
|
||||||
expect(Array.isArray(b.services)).toBe(true);
|
expect(Array.isArray(b.services)).toBe(true);
|
||||||
expect(typeof b.checkedAt).toBe('string');
|
expect(typeof b.checkedAt).toBe('string');
|
||||||
|
expect(new Date(b.checkedAt).getTime()).toBeGreaterThan(0);
|
||||||
|
|
||||||
|
const names = b.services.map((s) => s.name);
|
||||||
|
for (const svc of [...EXPECTED_HTTP_SERVICES, ...EXPECTED_INTERNAL]) {
|
||||||
|
expect(names).toContain(svc);
|
||||||
|
}
|
||||||
|
for (const svc of b.services) {
|
||||||
|
expect(VALID_STATUSES).toContain(svc.status);
|
||||||
|
expect(typeof svc.latencyMs).toBe('number');
|
||||||
|
}
|
||||||
|
} finally {
|
||||||
|
server.close();
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
it('ok=true when all HTTP services respond 200', async () => {
|
||||||
|
mockFetch(new Set(['api', 'ml-serving', 'mlflow', 'airflow']));
|
||||||
|
const { server, call } = await startServer(buildApp());
|
||||||
|
try {
|
||||||
|
const { body } = await call('GET', '/api/admin/health');
|
||||||
|
const b = body as HealthBody;
|
||||||
|
for (const name of EXPECTED_HTTP_SERVICES) {
|
||||||
|
const svc = b.services.find((s) => s.name === name);
|
||||||
|
expect(svc?.status, `${name} should be ok`).toBe('ok');
|
||||||
|
}
|
||||||
|
expect(b.ok).toBe(true);
|
||||||
|
} finally {
|
||||||
|
server.close();
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
it('ml-serving=down and ok=false when ml-serving is unreachable', async () => {
|
||||||
|
mockFetch(new Set(['api', 'mlflow', 'airflow'])); // ml-serving absent
|
||||||
|
const { server, call } = await startServer(buildApp());
|
||||||
|
try {
|
||||||
|
const { body } = await call('GET', '/api/admin/health');
|
||||||
|
const b = body as HealthBody;
|
||||||
|
const mlSvc = b.services.find((s) => s.name === 'ml-serving');
|
||||||
|
expect(mlSvc?.status).toBe('down');
|
||||||
|
expect(b.ok).toBe(false);
|
||||||
|
} finally {
|
||||||
|
server.close();
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
it('airflow=down and ok=false when airflow is unreachable', async () => {
|
||||||
|
mockFetch(new Set(['api', 'ml-serving', 'mlflow'])); // airflow absent
|
||||||
|
const { server, call } = await startServer(buildApp());
|
||||||
|
try {
|
||||||
|
const { body } = await call('GET', '/api/admin/health');
|
||||||
|
const b = body as HealthBody;
|
||||||
|
const svc = b.services.find((s) => s.name === 'airflow');
|
||||||
|
expect(svc?.status).toBe('down');
|
||||||
|
expect(b.ok).toBe(false);
|
||||||
|
} finally {
|
||||||
|
server.close();
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
it('mlflow=down and ok=false when mlflow is unreachable', async () => {
|
||||||
|
mockFetch(new Set(['api', 'ml-serving', 'airflow'])); // mlflow absent
|
||||||
|
const { server, call } = await startServer(buildApp());
|
||||||
|
try {
|
||||||
|
const { body } = await call('GET', '/api/admin/health');
|
||||||
|
const b = body as HealthBody;
|
||||||
|
const svc = b.services.find((s) => s.name === 'mlflow');
|
||||||
|
expect(svc?.status).toBe('down');
|
||||||
|
expect(b.ok).toBe(false);
|
||||||
|
} finally {
|
||||||
|
server.close();
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
it('sqlite and event-bus are always present regardless of HTTP service status', async () => {
|
||||||
|
mockFetch(new Set()); // all HTTP services down
|
||||||
|
const { server, call } = await startServer(buildApp());
|
||||||
|
try {
|
||||||
|
const { body } = await call('GET', '/api/admin/health');
|
||||||
|
const b = body as HealthBody;
|
||||||
|
expect(b.services.find((s) => s.name === 'sqlite')?.status).toBe('ok');
|
||||||
|
expect(b.services.find((s) => s.name === 'event-bus')?.status).toBe('ok');
|
||||||
} finally {
|
} finally {
|
||||||
server.close();
|
server.close();
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -1,4 +1,5 @@
|
|||||||
import { type Router as ExpressRouter, Router, Response } from 'express';
|
import { type Router as ExpressRouter, Router, Response, type Request } from 'express';
|
||||||
|
import { logger } from '../logger.js';
|
||||||
import { db, rawSqlite } from '../db/index.js';
|
import { db, rawSqlite } from '../db/index.js';
|
||||||
import {
|
import {
|
||||||
users,
|
users,
|
||||||
@@ -523,16 +524,24 @@ router.get('/data-quality', async (req: AuthenticatedRequest, res: Response) =>
|
|||||||
// Fan-out to all subsystem /health endpoints.
|
// Fan-out to all subsystem /health endpoints.
|
||||||
// ---------------------------------------------------------------------------
|
// ---------------------------------------------------------------------------
|
||||||
router.get('/health', async (_req: AuthenticatedRequest, res: Response) => {
|
router.get('/health', async (_req: AuthenticatedRequest, res: Response) => {
|
||||||
const checks: Array<{ name: string; url: string }> = [
|
const airflowAuth = Buffer.from(`${config.AIRFLOW_API_USER}:${config.AIRFLOW_API_PASSWORD}`).toString('base64');
|
||||||
{ name: 'api', url: `http://localhost:${process.env.PORT ?? 3001}/health` },
|
|
||||||
|
const checks: Array<{ name: string; url: string; headers?: Record<string, string> }> = [
|
||||||
|
{ name: 'api', url: `http://localhost:${config.PORT}/health` },
|
||||||
{ name: 'ml-serving', url: `${config.ML_SERVING_URL}/health` },
|
{ name: 'ml-serving', url: `${config.ML_SERVING_URL}/health` },
|
||||||
|
{ name: 'mlflow', url: `${config.MLFLOW_URL}/health` },
|
||||||
|
{ name: 'airflow', url: `${config.AIRFLOW_URL}/api/v1/health`,
|
||||||
|
headers: { Authorization: `Basic ${airflowAuth}` } },
|
||||||
];
|
];
|
||||||
|
|
||||||
const results = await Promise.allSettled(
|
const results = await Promise.allSettled(
|
||||||
checks.map(async ({ name, url }) => {
|
checks.map(async ({ name, url, headers }) => {
|
||||||
const t0 = Date.now();
|
const t0 = Date.now();
|
||||||
try {
|
try {
|
||||||
const r = await fetch(url, { signal: AbortSignal.timeout(3000) });
|
const r = await fetch(url, {
|
||||||
|
headers,
|
||||||
|
signal: AbortSignal.timeout(3000),
|
||||||
|
});
|
||||||
return { name, status: r.ok ? 'ok' : 'degraded', latencyMs: Date.now() - t0 };
|
return { name, status: r.ok ? 'ok' : 'degraded', latencyMs: Date.now() - t0 };
|
||||||
} catch {
|
} catch {
|
||||||
return { name, status: 'down', latencyMs: Date.now() - t0 };
|
return { name, status: 'down', latencyMs: Date.now() - t0 };
|
||||||
@@ -548,15 +557,12 @@ router.get('/health', async (_req: AuthenticatedRequest, res: Response) => {
|
|||||||
dbStatus = 'down';
|
dbStatus = 'down';
|
||||||
}
|
}
|
||||||
|
|
||||||
// Event bus: always ok if process is alive
|
|
||||||
const eventBusStatus = 'ok';
|
|
||||||
|
|
||||||
const services = results.map((r) =>
|
const services = results.map((r) =>
|
||||||
r.status === 'fulfilled' ? r.value : { name: 'unknown', status: 'down', latencyMs: 0 },
|
r.status === 'fulfilled' ? r.value : { name: 'unknown', status: 'down', latencyMs: 0 },
|
||||||
);
|
);
|
||||||
|
|
||||||
services.push({ name: 'sqlite', status: dbStatus, latencyMs: 0 });
|
services.push({ name: 'sqlite', status: dbStatus, latencyMs: 0 });
|
||||||
services.push({ name: 'event-bus', status: eventBusStatus, latencyMs: 0 });
|
services.push({ name: 'event-bus', status: 'ok', latencyMs: 0 });
|
||||||
|
|
||||||
const allOk = services.every((s) => s.status === 'ok');
|
const allOk = services.every((s) => s.status === 'ok');
|
||||||
res.json({ ok: allOk, services, checkedAt: new Date().toISOString() });
|
res.json({ ok: allOk, services, checkedAt: new Date().toISOString() });
|
||||||
@@ -699,22 +705,21 @@ router.delete('/saved-queries/:id', async (req: AuthenticatedRequest, res: Respo
|
|||||||
|
|
||||||
// ---------------------------------------------------------------------------
|
// ---------------------------------------------------------------------------
|
||||||
// POST /api/admin/simulate/start
|
// POST /api/admin/simulate/start
|
||||||
// Spawn ml/experiments/sim/runner.py in the background; return run_id.
|
// Trigger an Airflow DAG run (bandit_sim). Falls back to a local subprocess
|
||||||
|
// when AIRFLOW_URL is not reachable, so local dev still works.
|
||||||
// ---------------------------------------------------------------------------
|
// ---------------------------------------------------------------------------
|
||||||
router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response) => {
|
router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response) => {
|
||||||
const {
|
const {
|
||||||
nUsers = 5,
|
nUsers = 5,
|
||||||
nRounds = 20,
|
nRounds = 20,
|
||||||
tasksPerRound = 8,
|
tasksPerRound = 8,
|
||||||
useLlm = false,
|
|
||||||
judgeMode = 'rule',
|
judgeMode = 'rule',
|
||||||
policies = ['linucb-v1', 'egreedy-v1'],
|
policies = ['linucb-v1', 'egreedy-v1'],
|
||||||
} = req.body as {
|
} = req.body as {
|
||||||
nUsers?: number;
|
nUsers?: number;
|
||||||
nRounds?: number;
|
nRounds?: number;
|
||||||
tasksPerRound?: number;
|
tasksPerRound?: number;
|
||||||
useLlm?: boolean;
|
judgeMode?: 'rule' | 'llm';
|
||||||
judgeMode?: 'rule' | 'llm' | 'claude-code';
|
|
||||||
policies?: string[];
|
policies?: string[];
|
||||||
};
|
};
|
||||||
|
|
||||||
@@ -733,17 +738,69 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
|
|||||||
nUsers,
|
nUsers,
|
||||||
nRounds,
|
nRounds,
|
||||||
tasksPerRound,
|
tasksPerRound,
|
||||||
useLlm,
|
useLlm: judgeMode === 'llm',
|
||||||
|
judgeMode,
|
||||||
|
nPolicies: policies.length,
|
||||||
status: 'running',
|
status: 'running',
|
||||||
createdAt: now,
|
createdAt: now,
|
||||||
});
|
});
|
||||||
|
|
||||||
|
// ── Try Airflow first ────────────────────────────────────────────────────
|
||||||
|
if (config.AIRFLOW_URL && config.INTERNAL_API_TOKEN) {
|
||||||
|
try {
|
||||||
|
const airflowAuth = Buffer.from(
|
||||||
|
`${config.AIRFLOW_API_USER}:${config.AIRFLOW_API_PASSWORD}`,
|
||||||
|
).toString('base64');
|
||||||
|
|
||||||
|
const dagRes = await fetch(
|
||||||
|
`${config.AIRFLOW_URL}/api/v1/dags/bandit_sim/dagRuns`,
|
||||||
|
{
|
||||||
|
method: 'POST',
|
||||||
|
headers: {
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
Authorization: `Basic ${airflowAuth}`,
|
||||||
|
},
|
||||||
|
body: JSON.stringify({
|
||||||
|
conf: {
|
||||||
|
sim_run_id: id,
|
||||||
|
n_users: nUsers,
|
||||||
|
n_rounds: nRounds,
|
||||||
|
tasks_per_round: tasksPerRound,
|
||||||
|
policies,
|
||||||
|
judge_mode: judgeMode,
|
||||||
|
ml_url: config.ML_SERVING_URL,
|
||||||
|
mlflow_url: config.MLFLOW_URL,
|
||||||
|
callback_url: `${config.API_BASE_URL}/api/admin/simulate/${id}/complete`,
|
||||||
|
internal_token: config.INTERNAL_API_TOKEN,
|
||||||
|
},
|
||||||
|
}),
|
||||||
|
signal: AbortSignal.timeout(5000),
|
||||||
|
},
|
||||||
|
);
|
||||||
|
|
||||||
|
if (dagRes.ok) {
|
||||||
|
const dagBody = await dagRes.json() as { dag_run_id: string };
|
||||||
|
await db
|
||||||
|
.update(simRuns)
|
||||||
|
.set({ airflowDagRunId: dagBody.dag_run_id })
|
||||||
|
.where(eq(simRuns.id, id));
|
||||||
|
|
||||||
|
res.json({ id, status: 'running', airflow_dag_run_id: dagBody.dag_run_id });
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
logger.warn({ status: dagRes.status }, 'sim: Airflow trigger failed, falling back to subprocess');
|
||||||
|
} catch (err) {
|
||||||
|
logger.warn({ err }, 'sim: Airflow unreachable, falling back to subprocess');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Subprocess fallback (local dev / Airflow not configured) ────────────
|
||||||
const runnerPath = resolve(__dirname, '../../../../ml/experiments/sim/runner.py');
|
const runnerPath = resolve(__dirname, '../../../../ml/experiments/sim/runner.py');
|
||||||
const venvPython = resolve(__dirname, '../../../../ml/serving/.venv/bin/python');
|
const venvPython = resolve(__dirname, '../../../../ml/serving/.venv/bin/python');
|
||||||
const pythonBin = existsSync(venvPython) ? venvPython : 'python3';
|
const pythonBin = existsSync(venvPython) ? venvPython : 'python3';
|
||||||
const outPath = `/tmp/oo-sim-${id}.json`;
|
const outPath = `/tmp/oo-sim-${id}.json`;
|
||||||
|
|
||||||
const args = [
|
const child = spawn(pythonBin, [
|
||||||
runnerPath,
|
runnerPath,
|
||||||
'--n-users', String(nUsers),
|
'--n-users', String(nUsers),
|
||||||
'--n-rounds', String(nRounds),
|
'--n-rounds', String(nRounds),
|
||||||
@@ -751,32 +808,22 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
|
|||||||
'--ml-url', config.ML_SERVING_URL,
|
'--ml-url', config.ML_SERVING_URL,
|
||||||
'--policies', ...policies,
|
'--policies', ...policies,
|
||||||
'--out', outPath,
|
'--out', outPath,
|
||||||
'--judge', judgeMode === 'llm' ? 'llm' : judgeMode === 'claude-code' ? 'rule' : 'rule',
|
'--judge', judgeMode,
|
||||||
// claude-code mode isn't auto-runnable from the API (requires human in the loop)
|
'--mlflow-url', config.MLFLOW_URL,
|
||||||
// it falls back to rule judge when triggered from the panel
|
'--mlflow-experiment', 'bandit_simulation',
|
||||||
];
|
], { stdio: ['ignore', 'pipe', 'pipe'] });
|
||||||
|
|
||||||
const child = spawn(pythonBin, args, { stdio: ['ignore', 'pipe', 'pipe'] });
|
if (child.pid) _simProcesses.set(id, { pid: child.pid, startedAt: now });
|
||||||
|
|
||||||
if (child.pid) {
|
|
||||||
_simProcesses.set(id, { pid: child.pid, startedAt: now });
|
|
||||||
}
|
|
||||||
|
|
||||||
// Without this listener, a spawn failure (ENOENT when python3 is absent
|
|
||||||
// — e.g. in the alpine api container) would emit an unhandled 'error' event
|
|
||||||
// and crash the whole API process.
|
|
||||||
child.on('error', async (err) => {
|
child.on('error', async (err) => {
|
||||||
console.error('[sim] spawn error', err);
|
logger.error({ err }, 'sim: spawn error');
|
||||||
_simProcesses.delete(id);
|
_simProcesses.delete(id);
|
||||||
await db
|
await db.update(simRuns)
|
||||||
.update(simRuns)
|
|
||||||
.set({ status: 'failed', finishedAt: new Date().toISOString() })
|
.set({ status: 'failed', finishedAt: new Date().toISOString() })
|
||||||
.where(eq(simRuns.id, id));
|
.where(eq(simRuns.id, id));
|
||||||
});
|
});
|
||||||
|
|
||||||
// Capture stderr for debugging
|
child.stderr?.on('data', (d: Buffer) => logger.debug({ stderr: d.toString() }, 'sim stderr'));
|
||||||
const stderrLines: string[] = [];
|
|
||||||
child.stderr?.on('data', (d: Buffer) => stderrLines.push(d.toString()));
|
|
||||||
|
|
||||||
child.on('exit', async (code) => {
|
child.on('exit', async (code) => {
|
||||||
_simProcesses.delete(id);
|
_simProcesses.delete(id);
|
||||||
@@ -785,8 +832,6 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
|
|||||||
if (code === 0 && existsSync(outPath)) {
|
if (code === 0 && existsSync(outPath)) {
|
||||||
try {
|
try {
|
||||||
const raw = JSON.parse(readFileSync(outPath, 'utf-8'));
|
const raw = JSON.parse(readFileSync(outPath, 'utf-8'));
|
||||||
|
|
||||||
// Bulk-insert sim events
|
|
||||||
const eventRows = (raw.events ?? []).map((ev: Record<string, unknown>) => ({
|
const eventRows = (raw.events ?? []).map((ev: Record<string, unknown>) => ({
|
||||||
id: nanoid(),
|
id: nanoid(),
|
||||||
runId: id,
|
runId: id,
|
||||||
@@ -804,21 +849,19 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
|
|||||||
dayOfWeek: Number(ev.day_of_week),
|
dayOfWeek: Number(ev.day_of_week),
|
||||||
createdAt: now,
|
createdAt: now,
|
||||||
}));
|
}));
|
||||||
|
|
||||||
for (const row of eventRows) {
|
for (const row of eventRows) {
|
||||||
await db.insert(simEvents).values(row).catch(() => {});
|
await db.insert(simEvents).values(row).catch(() => {});
|
||||||
}
|
}
|
||||||
|
|
||||||
await db.update(simRuns).set({
|
await db.update(simRuns).set({
|
||||||
status: 'done',
|
status: 'done',
|
||||||
summaryJson: JSON.stringify(raw.summary),
|
summaryJson: JSON.stringify(raw.summary),
|
||||||
winner: raw.winner,
|
winner: raw.winner,
|
||||||
personaBreakdownJson: JSON.stringify(raw.persona_breakdown),
|
personaBreakdownJson: JSON.stringify(raw.persona_breakdown),
|
||||||
|
mlflowRunId: raw.mlflow_run_id ?? null,
|
||||||
finishedAt,
|
finishedAt,
|
||||||
}).where(eq(simRuns.id, id));
|
}).where(eq(simRuns.id, id));
|
||||||
|
|
||||||
try { unlinkSync(outPath); } catch { /* ignore */ }
|
try { unlinkSync(outPath); } catch { /* ignore */ }
|
||||||
} catch (e) {
|
} catch {
|
||||||
await db.update(simRuns).set({ status: 'failed', finishedAt }).where(eq(simRuns.id, id));
|
await db.update(simRuns).set({ status: 'failed', finishedAt }).where(eq(simRuns.id, id));
|
||||||
}
|
}
|
||||||
} else {
|
} else {
|
||||||
@@ -863,4 +906,68 @@ router.get('/simulate/:id', async (req: AuthenticatedRequest, res: Response) =>
|
|||||||
res.json({ run: { ...run, isRunning }, events });
|
res.json({ run: { ...run, isRunning }, events });
|
||||||
});
|
});
|
||||||
|
|
||||||
export { router as adminRouter };
|
// ---------------------------------------------------------------------------
|
||||||
|
// internalRouter — no session auth; only INTERNAL_API_TOKEN header check.
|
||||||
|
// Mounted separately in index.ts at /api/admin to avoid router.use() auth.
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
const internalRouter: ExpressRouter = Router();
|
||||||
|
|
||||||
|
internalRouter.post('/simulate/:id/complete', async (req: Request, res: Response) => {
|
||||||
|
const token = req.headers['x-internal-token'];
|
||||||
|
if (!config.INTERNAL_API_TOKEN || token !== config.INTERNAL_API_TOKEN) {
|
||||||
|
res.status(401).json({ error: 'Unauthorized' });
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const { id } = req.params as { id: string };
|
||||||
|
const { summary, winner, persona_breakdown, events: rawEvents, mlflow_run_id } =
|
||||||
|
req.body as {
|
||||||
|
summary: Record<string, unknown>;
|
||||||
|
winner: string;
|
||||||
|
persona_breakdown: Record<string, unknown>;
|
||||||
|
events: Record<string, unknown>[];
|
||||||
|
mlflow_run_id?: string;
|
||||||
|
};
|
||||||
|
|
||||||
|
const finishedAt = new Date().toISOString();
|
||||||
|
const now = finishedAt;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const eventRows = (rawEvents ?? []).map((ev) => ({
|
||||||
|
id: nanoid(),
|
||||||
|
runId: id,
|
||||||
|
round: Number(ev['round']),
|
||||||
|
userId: String(ev['user_id']),
|
||||||
|
persona: String(ev['persona']),
|
||||||
|
policy: String(ev['policy']),
|
||||||
|
tipContent: String(ev['tip_content']),
|
||||||
|
priority: Number(ev['priority']),
|
||||||
|
isOverdue: Boolean(ev['is_overdue']),
|
||||||
|
action: String(ev['action']),
|
||||||
|
dwellMs: ev['dwell_ms'] != null ? Number(ev['dwell_ms']) : null,
|
||||||
|
rewardMilli: Math.round(Number(ev['reward']) * 1000),
|
||||||
|
hour: Number(ev['hour']),
|
||||||
|
dayOfWeek: Number(ev['day_of_week']),
|
||||||
|
createdAt: now,
|
||||||
|
}));
|
||||||
|
for (const row of eventRows) {
|
||||||
|
await db.insert(simEvents).values(row).catch(() => {});
|
||||||
|
}
|
||||||
|
await db.update(simRuns).set({
|
||||||
|
status: 'done',
|
||||||
|
summaryJson: JSON.stringify(summary),
|
||||||
|
winner,
|
||||||
|
personaBreakdownJson: JSON.stringify(persona_breakdown),
|
||||||
|
mlflowRunId: mlflow_run_id ?? null,
|
||||||
|
finishedAt,
|
||||||
|
}).where(eq(simRuns.id, id));
|
||||||
|
|
||||||
|
res.json({ ok: true });
|
||||||
|
} catch (err) {
|
||||||
|
logger.error({ err }, 'sim: complete callback failed');
|
||||||
|
await db.update(simRuns).set({ status: 'failed', finishedAt }).where(eq(simRuns.id, id));
|
||||||
|
res.status(500).json({ error: 'Failed to store results' });
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
export { router as adminRouter, internalRouter as adminInternalRouter };
|
||||||
|
|||||||
@@ -5,6 +5,7 @@ import { db } from '../db/index.js';
|
|||||||
import { users, sessions } from '../db/schema.js';
|
import { users, sessions } from '../db/schema.js';
|
||||||
import { eq } from 'drizzle-orm';
|
import { eq } from 'drizzle-orm';
|
||||||
import { config } from '../config.js';
|
import { config } from '../config.js';
|
||||||
|
import { logger } from '../logger.js';
|
||||||
|
|
||||||
const router: ExpressRouter = Router();
|
const router: ExpressRouter = Router();
|
||||||
|
|
||||||
@@ -36,7 +37,7 @@ router.get('/login', async (req: Request, res: Response) => {
|
|||||||
setTimeout(() => pendingStates.delete(state), 10 * 60 * 1000);
|
setTimeout(() => pendingStates.delete(state), 10 * 60 * 1000);
|
||||||
|
|
||||||
const redirectUri = `${config.API_BASE_URL}/api/auth/callback`;
|
const redirectUri = `${config.API_BASE_URL}/api/auth/callback`;
|
||||||
console.log('[auth] redirect_uri sent to Google:', redirectUri);
|
logger.info({ redirectUri }, 'auth: redirect_uri');
|
||||||
const authUrl = client.buildAuthorizationUrl(cfg, {
|
const authUrl = client.buildAuthorizationUrl(cfg, {
|
||||||
redirect_uri: redirectUri,
|
redirect_uri: redirectUri,
|
||||||
scope: 'openid email profile',
|
scope: 'openid email profile',
|
||||||
@@ -72,7 +73,7 @@ router.get('/callback', async (req: Request, res: Response) => {
|
|||||||
expectedState: state,
|
expectedState: state,
|
||||||
});
|
});
|
||||||
} catch (err) {
|
} catch (err) {
|
||||||
console.error('OAuth callback error', err);
|
logger.error({ err }, 'auth: OAuth callback error');
|
||||||
res.status(400).json({ error: 'OAuth error' });
|
res.status(400).json({ error: 'OAuth error' });
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
@@ -123,6 +124,45 @@ router.get('/callback', async (req: Request, res: Response) => {
|
|||||||
.redirect(`${config.WEB_BASE_URL}${pending.redirectTo}`);
|
.redirect(`${config.WEB_BASE_URL}${pending.redirectTo}`);
|
||||||
});
|
});
|
||||||
|
|
||||||
|
/**
|
||||||
|
* POST /api/auth/token
|
||||||
|
* Exchange the static ADMIN_TOKEN for a session cookie.
|
||||||
|
* Finds the first admin user in the DB; rejects if ADMIN_TOKEN is not configured.
|
||||||
|
*/
|
||||||
|
router.post('/token', async (req: Request, res: Response) => {
|
||||||
|
const { token } = req.body as { token?: string };
|
||||||
|
if (!config.ADMIN_TOKEN || !token || token !== config.ADMIN_TOKEN) {
|
||||||
|
res.status(401).json({ error: 'Invalid token' });
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const [adminUser] = await db
|
||||||
|
.select()
|
||||||
|
.from(users)
|
||||||
|
.where(eq(users.role, 'admin'))
|
||||||
|
.limit(1);
|
||||||
|
|
||||||
|
if (!adminUser) {
|
||||||
|
res.status(403).json({ error: 'No admin user exists' });
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const sid = nanoid(32);
|
||||||
|
const now = new Date().toISOString();
|
||||||
|
const expiresAt = new Date(Date.now() + 24 * 60 * 60 * 1000).toISOString();
|
||||||
|
await db.insert(sessions).values({ id: sid, userId: adminUser.id, expiresAt, createdAt: now });
|
||||||
|
|
||||||
|
res
|
||||||
|
.cookie('sid', sid, {
|
||||||
|
httpOnly: true,
|
||||||
|
secure: config.NODE_ENV === 'production',
|
||||||
|
sameSite: 'lax',
|
||||||
|
expires: new Date(expiresAt),
|
||||||
|
path: '/',
|
||||||
|
})
|
||||||
|
.json({ ok: true });
|
||||||
|
});
|
||||||
|
|
||||||
/** POST /api/auth/logout */
|
/** POST /api/auth/logout */
|
||||||
router.post('/logout', async (req: Request, res: Response) => {
|
router.post('/logout', async (req: Request, res: Response) => {
|
||||||
const sid = req.cookies?.sid as string | undefined;
|
const sid = req.cookies?.sid as string | undefined;
|
||||||
|
|||||||
@@ -1,5 +1,6 @@
|
|||||||
import { type Router as ExpressRouter, Router, Response } from 'express';
|
import { type Router as ExpressRouter, Router, Response } from 'express';
|
||||||
import { nanoid } from 'nanoid';
|
import { nanoid } from 'nanoid';
|
||||||
|
import { logger } from '../logger.js';
|
||||||
import { db } from '../db/index.js';
|
import { db } from '../db/index.js';
|
||||||
import { integrationTokens, tipFeedback, tipViews, tipScores } from '../db/schema.js';
|
import { integrationTokens, tipFeedback, tipViews, tipScores } from '../db/schema.js';
|
||||||
import { eq, and, desc } from 'drizzle-orm';
|
import { eq, and, desc } from 'drizzle-orm';
|
||||||
@@ -47,7 +48,8 @@ export const _clearCandidateCacheForTests = () => {
|
|||||||
// Shadow-policy registry
|
// Shadow-policy registry
|
||||||
// ---------------------------------------------------------------------------
|
// ---------------------------------------------------------------------------
|
||||||
const shadowPolicies = new Map<string, { active: boolean }>([
|
const shadowPolicies = new Map<string, { active: boolean }>([
|
||||||
// egreedy-v2 (D=12, profile features) — disabled until sim gate per ADR-0012
|
// egreedy-v2 promoted to active policy (ADR-0012). Shadow entry kept for
|
||||||
|
// rollback toggle; leave disabled in normal operation.
|
||||||
['egreedy-v2-shadow', { active: false }],
|
['egreedy-v2-shadow', { active: false }],
|
||||||
]);
|
]);
|
||||||
|
|
||||||
@@ -84,6 +86,7 @@ async function remotePolicy(
|
|||||||
userId: string,
|
userId: string,
|
||||||
tasks: TipCandidate[],
|
tasks: TipCandidate[],
|
||||||
profile: Profile,
|
profile: Profile,
|
||||||
|
traceparent?: string,
|
||||||
): Promise<{ tipId: string; score: number; policy: string } | null> {
|
): Promise<{ tipId: string; score: number; policy: string } | null> {
|
||||||
const hour = new Date().getHours();
|
const hour = new Date().getHours();
|
||||||
const dayOfWeek = new Date().getDay();
|
const dayOfWeek = new Date().getDay();
|
||||||
@@ -101,17 +104,16 @@ async function remotePolicy(
|
|||||||
profile_features: profile,
|
profile_features: profile,
|
||||||
};
|
};
|
||||||
|
|
||||||
// Active policy: egreedy-v1 (selected over linucb-v1 after offline sim — ADR-0007)
|
|
||||||
try {
|
try {
|
||||||
const res = await fetch(`${config.ML_SERVING_URL}/score/egreedy`, {
|
const res = await fetch(`${config.ML_SERVING_URL}/score/egreedy/v2`, {
|
||||||
method: 'POST',
|
method: 'POST',
|
||||||
headers: { 'Content-Type': 'application/json' },
|
headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
|
||||||
body: JSON.stringify(body),
|
body: JSON.stringify(body),
|
||||||
signal: AbortSignal.timeout(3000),
|
signal: AbortSignal.timeout(3000),
|
||||||
});
|
});
|
||||||
if (!res.ok) return null;
|
if (!res.ok) return null;
|
||||||
const data = (await res.json()) as { tip_id: string; score: number };
|
const data = (await res.json()) as { tip_id: string; score: number };
|
||||||
return { tipId: data.tip_id, score: data.score, policy: 'egreedy-v1' };
|
return { tipId: data.tip_id, score: data.score, policy: 'egreedy-v2' };
|
||||||
} catch {
|
} catch {
|
||||||
return null;
|
return null;
|
||||||
}
|
}
|
||||||
@@ -145,6 +147,7 @@ async function fetchLlmCandidates(
|
|||||||
dayOfWeek: number,
|
dayOfWeek: number,
|
||||||
promptVersion: string | null,
|
promptVersion: string | null,
|
||||||
profile: Profile,
|
profile: Profile,
|
||||||
|
traceparent?: string,
|
||||||
): Promise<LlmGenerateResult> {
|
): Promise<LlmGenerateResult> {
|
||||||
try {
|
try {
|
||||||
const tasks = signals.slice(0, 10).map((s) => ({
|
const tasks = signals.slice(0, 10).map((s) => ({
|
||||||
@@ -155,7 +158,7 @@ async function fetchLlmCandidates(
|
|||||||
}));
|
}));
|
||||||
const res = await fetch(`${config.ML_SERVING_URL}/generate`, {
|
const res = await fetch(`${config.ML_SERVING_URL}/generate`, {
|
||||||
method: 'POST',
|
method: 'POST',
|
||||||
headers: { 'Content-Type': 'application/json' },
|
headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
|
||||||
body: JSON.stringify({
|
body: JSON.stringify({
|
||||||
user_id: userId,
|
user_id: userId,
|
||||||
context: { tasks, hour_of_day: hour, day_of_week: dayOfWeek },
|
context: { tasks, hour_of_day: hour, day_of_week: dayOfWeek },
|
||||||
@@ -225,6 +228,7 @@ router.post('/recommend', requireAuth, async (req: AuthenticatedRequest, res: Re
|
|||||||
dayOfWeek,
|
dayOfWeek,
|
||||||
requestedPromptVersion,
|
requestedPromptVersion,
|
||||||
profile,
|
profile,
|
||||||
|
req.traceparent,
|
||||||
);
|
);
|
||||||
|
|
||||||
const allCandidates: TipCandidate[] = [...signalCandidates, ...llmResult.candidates];
|
const allCandidates: TipCandidate[] = [...signalCandidates, ...llmResult.candidates];
|
||||||
@@ -239,7 +243,7 @@ router.post('/recommend', requireAuth, async (req: AuthenticatedRequest, res: Re
|
|||||||
const t0 = Date.now();
|
const t0 = Date.now();
|
||||||
|
|
||||||
// Stage 2: score — egreedy bandit with random fallback
|
// Stage 2: score — egreedy bandit with random fallback
|
||||||
const scored = await remotePolicy(req.userId!, allCandidates, profile);
|
const scored = await remotePolicy(req.userId!, allCandidates, profile, req.traceparent);
|
||||||
const latencyMs = Date.now() - t0;
|
const latencyMs = Date.now() - t0;
|
||||||
const tip = scored
|
const tip = scored
|
||||||
? (allCandidates.find((t) => t.id === scored.tipId) ?? randomPolicy(allCandidates))
|
? (allCandidates.find((t) => t.id === scored.tipId) ?? randomPolicy(allCandidates))
|
||||||
@@ -371,6 +375,8 @@ async function sendRewardWithRetry(
|
|||||||
tipId: string,
|
tipId: string,
|
||||||
reward: number,
|
reward: number,
|
||||||
features: TipCandidate['features'],
|
features: TipCandidate['features'],
|
||||||
|
profile: Profile,
|
||||||
|
traceparent?: string,
|
||||||
): Promise<void> {
|
): Promise<void> {
|
||||||
const body = JSON.stringify({
|
const body = JSON.stringify({
|
||||||
user_id: userId,
|
user_id: userId,
|
||||||
@@ -378,13 +384,14 @@ async function sendRewardWithRetry(
|
|||||||
reward,
|
reward,
|
||||||
features,
|
features,
|
||||||
day_of_week: new Date().getDay(),
|
day_of_week: new Date().getDay(),
|
||||||
|
profile_features: profile,
|
||||||
});
|
});
|
||||||
|
|
||||||
for (let attempt = 1; attempt <= 3; attempt++) {
|
for (let attempt = 1; attempt <= 3; attempt++) {
|
||||||
try {
|
try {
|
||||||
const res = await fetch(`${config.ML_SERVING_URL}/reward/egreedy`, {
|
const res = await fetch(`${config.ML_SERVING_URL}/reward/egreedy/v2`, {
|
||||||
method: 'POST',
|
method: 'POST',
|
||||||
headers: { 'Content-Type': 'application/json' },
|
headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
|
||||||
body,
|
body,
|
||||||
signal: AbortSignal.timeout(3000),
|
signal: AbortSignal.timeout(3000),
|
||||||
});
|
});
|
||||||
@@ -392,7 +399,7 @@ async function sendRewardWithRetry(
|
|||||||
throw new Error(`HTTP ${res.status}`);
|
throw new Error(`HTTP ${res.status}`);
|
||||||
} catch (err: any) {
|
} catch (err: any) {
|
||||||
if (attempt === 3) {
|
if (attempt === 3) {
|
||||||
console.error(`[reward] failed after 3 attempts for tip ${tipId}: ${err.message}`);
|
logger.error({ tipId, err }, 'reward: failed after 3 attempts');
|
||||||
bus.publish('signals.tip.reward_failed', {
|
bus.publish('signals.tip.reward_failed', {
|
||||||
userId,
|
userId,
|
||||||
tipId,
|
tipId,
|
||||||
@@ -463,7 +470,9 @@ router.post('/tip/:id/feedback', requireAuth, async (req: AuthenticatedRequest,
|
|||||||
});
|
});
|
||||||
|
|
||||||
if (candidate) {
|
if (candidate) {
|
||||||
sendRewardWithRetry(req.userId!, tipId, reward, candidate.features);
|
// Re-fetch profile for the v2 ridge update; TTL cache makes this near-instant.
|
||||||
|
const profile = await getProfile(req.userId!);
|
||||||
|
sendRewardWithRetry(req.userId!, tipId, reward, candidate.features, profile, req.traceparent);
|
||||||
}
|
}
|
||||||
|
|
||||||
// Delegate action to the owning signal source (e.g. mark done in Todoist)
|
// Delegate action to the owning signal source (e.g. mark done in Todoist)
|
||||||
|
|||||||
@@ -8,6 +8,11 @@
|
|||||||
*/
|
*/
|
||||||
import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
|
import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
|
||||||
|
|
||||||
|
vi.mock('../../logger.js', () => ({
|
||||||
|
logger: { info: vi.fn(), warn: vi.fn(), error: vi.fn(), fatal: vi.fn() },
|
||||||
|
}));
|
||||||
|
import { logger } from '../../logger.js';
|
||||||
|
|
||||||
// ── mock the drizzle query chain: db.select(...).from(...).where(...) ────────
|
// ── mock the drizzle query chain: db.select(...).from(...).where(...) ────────
|
||||||
let users: { userId: string }[] = [];
|
let users: { userId: string }[] = [];
|
||||||
const whereMock = vi.fn(async () => users);
|
const whereMock = vi.fn(async () => users);
|
||||||
@@ -35,6 +40,7 @@ beforeEach(() => {
|
|||||||
whereMock.mockClear();
|
whereMock.mockClear();
|
||||||
fromMock.mockClear();
|
fromMock.mockClear();
|
||||||
selectMock.mockClear();
|
selectMock.mockClear();
|
||||||
|
vi.clearAllMocks();
|
||||||
vi.useFakeTimers();
|
vi.useFakeTimers();
|
||||||
});
|
});
|
||||||
|
|
||||||
@@ -102,8 +108,6 @@ describe('startTodoistSyncScheduler', () => {
|
|||||||
if (id === 'bad') throw new Error('todoist 401');
|
if (id === 'bad') throw new Error('todoist 401');
|
||||||
return [];
|
return [];
|
||||||
});
|
});
|
||||||
const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
|
|
||||||
const logSpy = vi.spyOn(console, 'log').mockImplementation(() => {});
|
|
||||||
|
|
||||||
startTodoistSyncScheduler(60_000);
|
startTodoistSyncScheduler(60_000);
|
||||||
await vi.advanceTimersByTimeAsync(10_001);
|
await vi.advanceTimersByTimeAsync(10_001);
|
||||||
@@ -112,19 +116,27 @@ describe('startTodoistSyncScheduler', () => {
|
|||||||
await Promise.resolve();
|
await Promise.resolve();
|
||||||
|
|
||||||
expect(fetchSignalsMock).toHaveBeenCalledTimes(3);
|
expect(fetchSignalsMock).toHaveBeenCalledTimes(3);
|
||||||
expect(errSpy).toHaveBeenCalledWith(expect.stringContaining('sync error'), expect.anything());
|
expect(logger.error).toHaveBeenCalledWith(
|
||||||
expect(logSpy).toHaveBeenCalledWith(expect.stringContaining('2 ok, 1 failed'));
|
expect.objectContaining({ err: expect.anything() }),
|
||||||
|
'scheduler: sync error',
|
||||||
|
);
|
||||||
|
expect(logger.info).toHaveBeenCalledWith(
|
||||||
|
expect.objectContaining({ ok: 2, failed: 1 }),
|
||||||
|
'scheduler: todoist sync',
|
||||||
|
);
|
||||||
});
|
});
|
||||||
|
|
||||||
it('survives a db query failure — logs and skips the tick', async () => {
|
it('survives a db query failure — logs and skips the tick', async () => {
|
||||||
const { startTodoistSyncScheduler } = await import('../scheduler.js');
|
const { startTodoistSyncScheduler } = await import('../scheduler.js');
|
||||||
whereMock.mockRejectedValueOnce(new Error('sqlite locked'));
|
whereMock.mockRejectedValueOnce(new Error('sqlite locked'));
|
||||||
const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
|
|
||||||
|
|
||||||
startTodoistSyncScheduler(60_000);
|
startTodoistSyncScheduler(60_000);
|
||||||
await vi.advanceTimersByTimeAsync(10_001);
|
await vi.advanceTimersByTimeAsync(10_001);
|
||||||
|
|
||||||
expect(fetchSignalsMock).not.toHaveBeenCalled();
|
expect(fetchSignalsMock).not.toHaveBeenCalled();
|
||||||
expect(errSpy).toHaveBeenCalledWith(expect.stringContaining('failed to query users'));
|
expect(logger.error).toHaveBeenCalledWith(
|
||||||
|
expect.objectContaining({ err: expect.anything() }),
|
||||||
|
'scheduler: failed to query users',
|
||||||
|
);
|
||||||
});
|
});
|
||||||
});
|
});
|
||||||
|
|||||||
@@ -1,4 +1,5 @@
|
|||||||
import type { Signal, SignalSource } from '@oo/shared-types';
|
import type { Signal, SignalSource } from '@oo/shared-types';
|
||||||
|
import { logger } from '../logger.js';
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Merges signals from all registered sources for a user.
|
* Merges signals from all registered sources for a user.
|
||||||
@@ -24,7 +25,7 @@ export class SignalAggregator {
|
|||||||
if (r.status === 'fulfilled') {
|
if (r.status === 'fulfilled') {
|
||||||
signals.push(...r.value);
|
signals.push(...r.value);
|
||||||
} else {
|
} else {
|
||||||
console.error(`[aggregator] source '${this.sources[i].id}' failed:`, r.reason);
|
logger.error({ sourceId: this.sources[i]!.id, err: r.reason }, 'aggregator: source failed');
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
return signals;
|
return signals;
|
||||||
|
|||||||
@@ -13,6 +13,7 @@ import { db } from '../db/index.js';
|
|||||||
import { integrationTokens } from '../db/schema.js';
|
import { integrationTokens } from '../db/schema.js';
|
||||||
import { eq } from 'drizzle-orm';
|
import { eq } from 'drizzle-orm';
|
||||||
import { todoistSource } from './todoist.js';
|
import { todoistSource } from './todoist.js';
|
||||||
|
import { logger } from '../logger.js';
|
||||||
|
|
||||||
const DEFAULT_INTERVAL_MS = 15 * 60 * 1000;
|
const DEFAULT_INTERVAL_MS = 15 * 60 * 1000;
|
||||||
|
|
||||||
@@ -25,7 +26,7 @@ export function startTodoistSyncScheduler(intervalMs = DEFAULT_INTERVAL_MS): Nod
|
|||||||
.from(integrationTokens)
|
.from(integrationTokens)
|
||||||
.where(eq(integrationTokens.tokenStatus, 'active'));
|
.where(eq(integrationTokens.tokenStatus, 'active'));
|
||||||
} catch (err: any) {
|
} catch (err: any) {
|
||||||
console.error(`[scheduler] failed to query users: ${err.message}`);
|
logger.error({ err }, 'scheduler: failed to query users');
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -39,10 +40,10 @@ export function startTodoistSyncScheduler(intervalMs = DEFAULT_INTERVAL_MS): Nod
|
|||||||
let failed = 0;
|
let failed = 0;
|
||||||
for (const r of results) {
|
for (const r of results) {
|
||||||
if (r.status === 'fulfilled') ok++;
|
if (r.status === 'fulfilled') ok++;
|
||||||
else { failed++; console.error(`[scheduler] sync error:`, r.reason); }
|
else { failed++; logger.error({ err: r.reason }, 'scheduler: sync error'); }
|
||||||
}
|
}
|
||||||
|
|
||||||
console.log(`[scheduler] todoist sync: ${ok} ok, ${failed} failed (${users.length} users)`);
|
logger.info({ ok, failed, total: users.length }, 'scheduler: todoist sync');
|
||||||
}
|
}
|
||||||
|
|
||||||
// Run once shortly after startup, then on interval
|
// Run once shortly after startup, then on interval
|
||||||
|
|||||||
@@ -3,6 +3,7 @@ import { db } from '../db/index.js';
|
|||||||
import { integrationTokens } from '../db/schema.js';
|
import { integrationTokens } from '../db/schema.js';
|
||||||
import { eq, and } from 'drizzle-orm';
|
import { eq, and } from 'drizzle-orm';
|
||||||
import { bus } from '../events/bus.js';
|
import { bus } from '../events/bus.js';
|
||||||
|
import { logger } from '../logger.js';
|
||||||
|
|
||||||
const CACHE_TTL_MS = 30_000;
|
const CACHE_TTL_MS = 30_000;
|
||||||
|
|
||||||
@@ -46,7 +47,7 @@ export class TodoistSignalSource implements SignalSource {
|
|||||||
|
|
||||||
if (!res.ok) {
|
if (!res.ok) {
|
||||||
if (res.status === 401) {
|
if (res.status === 401) {
|
||||||
console.error(`[todoist] token expired for user ${userId}`);
|
logger.warn({ userId }, 'todoist: token expired');
|
||||||
bus.publish('signals.integration.token_expired', {
|
bus.publish('signals.integration.token_expired', {
|
||||||
userId,
|
userId,
|
||||||
provider: 'todoist',
|
provider: 'todoist',
|
||||||
@@ -88,7 +89,7 @@ export class TodoistSignalSource implements SignalSource {
|
|||||||
});
|
});
|
||||||
|
|
||||||
this.cache.set(userId, { signals, fetchedAt: Date.now() });
|
this.cache.set(userId, { signals, fetchedAt: Date.now() });
|
||||||
bus.publish('signals.task.synced', { userId, count: signals.length, syncedAt: now });
|
bus.publish('signals.task.synced', { userId, source: 'todoist', count: signals.length, syncedAt: now });
|
||||||
|
|
||||||
return signals;
|
return signals;
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -2,30 +2,49 @@
|
|||||||
|
|
||||||
Third-party connectors and the token vault.
|
Third-party connectors and the token vault.
|
||||||
|
|
||||||
## Connector interface
|
## Signal source interface
|
||||||
|
|
||||||
|
Each connector implements `SignalSource` from `@oo/shared-types`:
|
||||||
|
|
||||||
```ts
|
```ts
|
||||||
interface Connector {
|
interface SignalSource {
|
||||||
id: string // e.g. "todoist"
|
readonly id: string // e.g. "todoist"
|
||||||
scopes: string[] // human-readable list shown in consent UI
|
fetchSignals(userId: string): Promise<Signal[]> // returns normalized Signal[]
|
||||||
beginOAuth(user): Promise<{ redirectUrl, state }>
|
act?(userId: string, signalId: string, action: string): Promise<void> // optional write-back
|
||||||
finishOAuth(code, state): Promise<StoredCredential>
|
|
||||||
fetchSignals(user, since?): AsyncIterable<NormalizedEvent>
|
|
||||||
// incremental-sync cursor (Todoist sync_token, webhook timestamps, etc.)
|
|
||||||
// stored in Credential.meta; the connector owns its shape.
|
|
||||||
act?(user, action): Promise<void> // optional write-back (complete task, etc.)
|
|
||||||
revoke(user): Promise<void> // REQUIRED: provider-side token revocation on disconnect
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
`SignalAggregator` (`services/api/src/signals/aggregator.ts`) fans out to all registered sources in parallel, isolating per-source failures.
|
||||||
|
|
||||||
## Token vault
|
## Token vault
|
||||||
|
|
||||||
- Credentials encrypted at rest (libsodium sealed box); key from env/KMS.
|
OAuth tokens stored in the `integration_tokens` SQLite table (`services/api/src/db/schema.ts`):
|
||||||
- Refresh handled transparently; consumers never see raw tokens.
|
|
||||||
- One row per `(user, provider)` with provider-specific `meta`.
|
|
||||||
|
|
||||||
## Roadmap
|
| Column | Description |
|
||||||
|
|--------|-------------|
|
||||||
|
| `userId` | owner |
|
||||||
|
| `provider` | e.g. `todoist` |
|
||||||
|
| `accessToken` | OAuth access token (plain in dev; encrypted in prod via server secret store) |
|
||||||
|
| `tokenStatus` | `active` \| `needs_reconnect` |
|
||||||
|
|
||||||
- Phase 0: **Todoist** (OAuth2, read tasks, complete task).
|
On a 401 from the upstream API, the connector marks the token `needs_reconnect` and publishes `signals.integration.token_expired` so the client can prompt re-auth.
|
||||||
- Phase 2: Google Calendar, Apple Health (web import), generic webhook ingress.
|
|
||||||
- Phase 5: public SDK so third parties can ship connectors.
|
## Routes
|
||||||
|
|
||||||
|
| Method | Path | Description |
|
||||||
|
|--------|------|-------------|
|
||||||
|
| `GET` | `/api/integrations` | List connected integrations for current user |
|
||||||
|
| `GET` | `/api/integrations/todoist/connect` | Start Todoist OAuth flow |
|
||||||
|
| `GET` | `/api/integrations/todoist/callback` | OAuth callback — exchange code, store token |
|
||||||
|
| `DELETE` | `/api/integrations/:provider` | Disconnect + delete token |
|
||||||
|
|
||||||
|
## Connectors
|
||||||
|
|
||||||
|
| Connector | Status | Signals produced |
|
||||||
|
|-----------|--------|-----------------|
|
||||||
|
| Todoist | Phase 1 — active | `task` signals (today + overdue); `done` write-back |
|
||||||
|
| Google Calendar | Phase 2 — planned | `event` signals |
|
||||||
|
|
||||||
|
## Extraction criteria
|
||||||
|
|
||||||
|
Extract to its own process when credential blast-radius isolation requires it (e.g. token vault with KMS-backed encryption needs to run in a hardened sidecar) or when connector volume justifies separate scaling.
|
||||||
|
|||||||
@@ -1,29 +1,42 @@
|
|||||||
# recommender
|
# recommender
|
||||||
|
|
||||||
The core of oO. Takes a user + a context, returns **one** tip.
|
The core of oO. Takes a user + context, returns **one** tip.
|
||||||
|
|
||||||
## Contract
|
## Contract
|
||||||
|
|
||||||
```
|
```
|
||||||
POST /recommend
|
POST /api/recommend
|
||||||
{ user_id, context?: { time, timezone, client, ... } }
|
{ } (user inferred from session)
|
||||||
→ { tip: { id, kind: "todo"|"advice", title, body, source, deep_link, meta } }
|
→ { tip: { id, content, source, kind, sourceId?, rationale?, createdAt } }
|
||||||
|
|
||||||
POST /feedback
|
POST /api/tip/:id/feedback
|
||||||
{ user_id, tip_id, reaction: "done"|"snooze"|"dismiss", at }
|
{ action: "done"|"dismiss"|"snooze"|"helpful"|"not_helpful", dwellMs? }
|
||||||
|
→ { ok: true }
|
||||||
```
|
```
|
||||||
|
|
||||||
## Internals (stable seams)
|
## Pipeline
|
||||||
|
|
||||||
- **Candidate sources** — pluggable async generators. v0: Todoist tasks via `integrations`. Later: advice library, calendar nudges, health prompts.
|
1. **Signals** — `SignalAggregator.fetchAll(userId)` fans out to all registered `SignalSource` implementations in parallel. Currently: `TodoistSignalSource`. Add a source via `aggregator.register(new MySource())`.
|
||||||
- **Feature assembler** — fills the `context` blob (inline in Phase 0; calls feature store from M1). Never inlined into policy code.
|
2. **LLM candidates** — `POST /generate` on `ml/serving` returns `TipCandidate[]` from the `tip-generator` LiteLLM alias.
|
||||||
- **Policy registry** — `Policy.pick(candidates, context) → tip`. Named entries:
|
3. **Scoring** — all candidates sent to `ml/serving` active policy (`POST /score/egreedy`). Falls back to random if `ml/serving` is unreachable.
|
||||||
- `random` — v0 (Phase 0).
|
4. **Shadow policies** — active policy runs shadow policies in the same request for offline comparison (ADR-0002). Currently: `egreedy-v2` shadows `egreedy-v1`.
|
||||||
- `bandit.linucb.pooled` — v1 (Phase 1). **Global-then-personalize**: pooled features shared across users; per-user residual once data allows.
|
5. **Persistence** — `tipViews` + `tipScores` rows written on every serve; `tipFeedback` row on reaction.
|
||||||
- `remote` — delegates to `ml/serving` FastAPI scorer (Phase 1+).
|
6. **Reward delivery** — reaction triggers `POST /reward/egreedy` on `ml/serving` with inferred reward value.
|
||||||
- **Shadow hook** — every request optionally runs N shadow policies in parallel and logs their picks + estimated rewards. Promotion from shadow → A/B → launch is a separate, deliberate step (ADR-0002).
|
|
||||||
- **TipInstance persistence** — every decision writes `context_snapshot` (features seen at decision time). This is what makes offline replay honest.
|
|
||||||
|
|
||||||
## Phase 0 goal
|
## Signal normalization
|
||||||
|
|
||||||
`RandomPolicy` only. The service, contract, registry, shadow hook, and tip-instance persistence all exist; no ML yet.
|
Signals carry `features: Record<string, number | boolean>` (bandit-ready) and `metadata: Record<string, unknown>` (source-specific raw fields). The bandit treats features as an opaque dict — sources own their feature names. See ADR-0009.
|
||||||
|
|
||||||
|
## Policy registry
|
||||||
|
|
||||||
|
| Policy | Status | Notes |
|
||||||
|
|--------|--------|-------|
|
||||||
|
| `random` | Fallback | Used when ml/serving is unreachable |
|
||||||
|
| `egreedy-v1` | Shadow | d=7, ADR-0007 |
|
||||||
|
| `egreedy-v2` | **Active** | d=12 + profile features, ADR-0012 |
|
||||||
|
|
||||||
|
Shadow → active promotion requires offline sim + online agreement (ADR-0002).
|
||||||
|
|
||||||
|
## Extraction criteria
|
||||||
|
|
||||||
|
Extract to its own process at scaling hotspot: when `POST /recommend` p99 latency exceeds SLA or when recommendation CPU displaces API serving on shared host.
|
||||||
|
|||||||
Reference in New Issue
Block a user