Compare commits

..

14 Commits

Author SHA1 Message Date
e40dfdcbb0 chore(infra): wire MLflow/Airflow env vars, fix healthcheck, add .dockerignore
Some checks failed
buf-check / Lint & breaking-change check (push) Has been cancelled
- docker-compose: pass ML_SERVING_URL, MLFLOW_URL, AIRFLOW_URL + creds to api service
- docker-compose: pass NEXT_PUBLIC_MLFLOW_URL/AIRFLOW_URL to admin service
- docker-compose: replace wget healthcheck with node fetch (wget not in node image)
- docker-compose: enable Airflow basic_auth API backend; add MLflow pip dep for DAGs
- Dockerfiles: tighten layer caching, add .dockerignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 12:08:43 +00:00
bad1bb2cba feat(simulate): MLflow tracking, Airflow DAG integration, health checks for mlflow/airflow
- sim_runs schema: add judge_mode, n_policies, airflow_dag_run_id, mlflow_run_id columns
- admin health endpoint: add mlflow + airflow checks (Basic auth for Airflow API)
- admin nav: add Simulations page link; rename section label
- runner.py: optional MLflow experiment tracking; multi-policy support
- sim_dag.py: Airflow DAG for offline sim pipeline
- admin simulate page + API client methods for sim runs
- shared-types tsconfig: exclude test files from build

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 12:08:36 +00:00
e96ceb7ee1 feat(auth): token-based admin authentication for Playwright/CI (#105)
Add POST /api/auth/token — validates ADMIN_TOKEN env var, creates a 24h
session and sets the sid cookie so automated tools can access the admin
panel without Google OAuth. Admin login page gains a token input form.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 12:07:43 +00:00
b554970032 docs(observability): add services/api README; update ml/serving + recommender docs (#18)
- services/api/README.md: new — contract, middleware stack, background
  tasks, config table (LOG_LEVEL, SENTRY_DSN), health story, extraction
  criteria
- ml/serving/README.md: add Observability section (structlog JSON,
  traceparent → trace_id binding), add SENTRY_DSN + ENV to config table
- services/recommender/README.md: fix policy table — egreedy-v2 is
  active (#99), egreedy-v1 is shadow

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 03:41:39 +00:00
c4960d0601 feat(observability): structured logs, W3C trace IDs, Sentry hooks (#18)
- TS: pino + pino-http; every HTTP request log includes traceId from
  W3C traceparent header (generated if absent); forwarded to ml/serving
  on all /score, /generate, /reward, and /api/ml proxy calls
- Python: structlog JSON; FastAPI middleware binds trace_id via
  contextvars so every log line within a request carries it
- Sentry: optional SENTRY_DSN init in both runtimes (no-op if unset)
- Replace all console.* calls across services/api with pino logger
- Update tests to spy on logger instead of console

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 03:37:28 +00:00
7281af83a4 feat(bandit): promote egreedy-v2 (D=12, profile features) as active policy (#99)
Offline sim gate passed — egreedy-v2 mean reward −0.629 vs egreedy-v1 −0.642
(5 users × 20 rounds, rule judge, seed 42). v2 wins 3/5 personas.

- recommender.ts: switch remotePolicy() to /score/egreedy/v2
- recommender.ts: switch sendRewardWithRetry() to /reward/egreedy/v2 with
  profile_features payload so the ridge update uses the full D=12 vector
- recommender.ts: re-fetch profile at feedback time (TTL-cached, near-instant)
- ADR-0012: status Accepted → Promoted, promotion record appended

Shadow entry egreedy-v2-shadow kept in registry (active: false) for rollback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 03:08:28 +00:00
cba3f1a184 docs(services): update integrations + recommender READMEs for signal abstraction (#78)
integrations/README — replace stale Connector interface and fictional
libsodium vault with the actual SignalSource pattern, SQLite token table,
and real OAuth routes.

recommender/README — document the SignalAggregator pipeline, current
policy registry, and actual /recommend + /feedback contract shapes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 17:17:38 +00:00
352469162d fix(signals): add missing source field to TaskSyncedEvent (#78)
TaskSyncedPayload in shared-types and ml/serving schemas both require
source, but TaskSyncedEvent in bus.ts and the todoist publish call both
omitted it — causing the JetStream consumer to nak every task.synced
message on validation failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 17:15:32 +00:00
45416000f9 feat(features): per-feature freshness spec — JIT vs batched (#61)
Each ml/features/*.py now declares freshness, source, and fallback per
feature. ProfileFeature gains ttl_sec (mirrored from registry.ts),
freshness="batched", source, and fallback. context.py adds
ContextFeatureSpec + CONTEXT_FEATURES for the three JIT features
(hour_of_day, day_of_week, tasks). CI test parses ttlSec from registry.ts
to catch drift. ml/README updated with split JIT/batched feature contract.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 17:02:55 +00:00
bd3ea1b8b1 docs(schema): update docs for #54 — proto registry + buf CI gate
- packages/shared-types/README.md: new — documents HTTP vs event surfaces,
  proto file layout, schema evolution rules, and how to run buf locally
- ml/serving/README.md: note pydantic payload validation in consumer section
- CLAUDE.md: replace "schema registry enforced when #54 lands" with
  the actual state; remove #54 from active-work list

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 16:53:20 +00:00
377373a95d test(schema): unit tests for schemas.py and nats_consumer._handle (#54)
17 tests covering: pydantic model validation (all payload types, optional
fields, invalid enum values, missing required fields), _handle write path
for task_synced, validation errors surfaced through _make_handler causing
nak instead of ack.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 16:51:15 +00:00
d539fde0c1 feat(schema): protobuf event registry + buf CI gate (#54)
- Add proto schemas in packages/shared-types/events/ (oo.events.v1):
  envelope.proto, signals.proto, integration.proto
- buf.yaml with STANDARD lint + FILE breaking-change rules
- .gitea/workflows/buf-check.yaml: lint + breaking check on every PR
  touching events/ (needs a Gitea Actions runner to execute)
- scripts/buf-check.sh: local equivalent of the CI check
- NormalizedEvent TS envelope gains eventId, schemaVersion, producer
  to align with the proto Envelope message
- ml/serving/schemas.py: pydantic models mirroring the v1 proto types
- nats_consumer.py: validate payloads via pydantic instead of raw .get()

A field-rename PR will now fail buf breaking with exit code 100 and
show the offending messages. To make a breaking change: keep the old
field reserved, add the new one, bump schema_version to v2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 16:48:24 +00:00
f48b5a7646 docs(ml): serving README + update ml/README and CLAUDE.md for #98
- ml/serving/README.md: new — contract, JetStream consumer docs, config,
  health story, extraction criteria, state file reference
- ml/README.md: note JetStream consumers in serving/ row
- CLAUDE.md: update active work to reflect #98 shipped, #99 still pending

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 10:21:40 +00:00
4652e4b582 feat(ml): JetStream durable consumers in ml/serving (#98)
Adds a NATS JetStream consumer to ml/serving so the feature pipeline
can react to events without the API triggering every read.

- nats_consumer.py: durable push consumers for signals.> and feedback.>
  streams; acks on success, naks for redeliver, up to NATS_MAX_DELIVER
  attempts; per-consumer health state (last_msg_ts, processed, errors)
- main.py: FastAPI lifespan wires start/stop; /health exposes nats state
- requirements.txt: adds nats-py>=2.9.0
- Dockerfile.ml: copy all *.py from ml/serving (was missing prompts.py)

Handled subjects:
  signals.task.synced   → writes per-user sync metadata to STATE_DIR
  signals.tip.feedback  → logged for observability (reward via HTTP path)

Config: NATS_URL (empty = disabled), NATS_DURABLE_PREFIX, NATS_MAX_DELIVER

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 10:19:47 +00:00
62 changed files with 3198 additions and 278 deletions

19
.dockerignore Normal file
View File

@@ -0,0 +1,19 @@
**/node_modules
**/.next
**/dist
**/coverage
**/.vitest-cache
**/.turbo
.git
.gitea
.github
.vscode
.idea
**/.env
**/.env.local
**/*.log
docs
infra/docker/data
**/__tests__
**/*.test.ts
**/*.test.tsx

View File

@@ -10,6 +10,32 @@ API_BASE_URL=http://localhost:3078
WEB_BASE_URL=http://localhost:3000 WEB_BASE_URL=http://localhost:3000
ML_SERVING_URL=http://localhost:8000 ML_SERVING_URL=http://localhost:8000
# MLflow (mlops profile) — http://localhost:5000/mlflow in dev, https://o.alogins.net/mlflow in prod.
# MLFLOW_ADMIN_PASSWORD seeds the admin account on first boot (changing it after first run
# requires the MLflow UI or API — see infra/mlflow/basic_auth.ini).
MLFLOW_URL=http://localhost:5000
MLFLOW_ADMIN_PASSWORD=change-me
# Public URL shown as link in the admin sidebar (must be NEXT_PUBLIC_ to reach the browser).
NEXT_PUBLIC_MLFLOW_URL=http://localhost:5000
# Airflow (mlops profile) — http://localhost:8080/airflow in dev.
# Start with: docker compose --profile full --profile mlops up
AIRFLOW_URL=http://localhost:8080
AIRFLOW_ADMIN_PASSWORD=change-me
AIRFLOW_DB_PASSWORD=airflow
AIRFLOW_SECRET_KEY=change-me-in-prod
AIRFLOW_FERNET_KEY=
AIRFLOW_BASE_URL=https://o.alogins.net/airflow
# Public URL shown as link in the admin sidebar (must be NEXT_PUBLIC_ to reach the browser).
NEXT_PUBLIC_AIRFLOW_URL=http://localhost:8080
# Shared secret for Airflow→API internal callbacks. Generate: openssl rand -hex 32
INTERNAL_API_TOKEN=
# Static token for automated/service access to the admin panel (e.g. Playwright tests).
# Leave empty to disable token-based login. Generate: openssl rand -hex 32
ADMIN_TOKEN=
# AI stack — shared Agap services (ollama + litellm + langfuse). Not run from oO. # AI stack — shared Agap services (ollama + litellm + langfuse). Not run from oO.
# Prod: https://llm.alogins.net | Dev: http://host.docker.internal:4000 from containers, # Prod: https://llm.alogins.net | Dev: http://host.docker.internal:4000 from containers,
# http://localhost:4000 from host. Ollama: http://host.docker.internal:11434 / :11434. # http://localhost:4000 from host. Ollama: http://host.docker.internal:11434 / :11434.

View File

@@ -0,0 +1,37 @@
name: buf-check
on:
push:
branches: [main]
paths:
- 'packages/shared-types/events/**'
pull_request:
paths:
- 'packages/shared-types/events/**'
jobs:
buf:
name: Lint & breaking-change check
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Install buf
run: |
BUF_VERSION=1.50.0
curl -sSfL \
"https://github.com/bufbuild/buf/releases/download/v${BUF_VERSION}/buf-Linux-x86_64" \
-o /usr/local/bin/buf
chmod +x /usr/local/bin/buf
buf --version
- name: buf lint
run: buf lint packages/shared-types/events
- name: buf breaking
if: github.event_name == 'pull_request'
run: |
buf breaking packages/shared-types/events \
--against ".git#branch=${{ github.base_ref }},subdir=packages/shared-types/events"

View File

@@ -56,7 +56,7 @@ docs/ architecture notes, ADRs, API specs
## Contracts between modules ## Contracts between modules
- **HTTP** (OpenAPI, in `packages/shared-types/http/`) — synchronous request/response. In-process today; over the network once extracted. Signatures are identical. - **HTTP** (OpenAPI, in `packages/shared-types/http/`) — synchronous request/response. In-process today; over the network once extracted. Signatures are identical.
- **Events** (Protocol Buffers, in `packages/shared-types/events/`) — durable signals + feedback. Today: in-process `Bus` with a `onPublish` bridge to NATS JetStream when `NATS_URL` is set (ADR-0010). The in-proc bus stays the source of truth — JetStream is the durable mirror that cross-process consumers (`ml/serving`, future feature pipelines) tail. Schema registry enforced in CI when #54 lands; until then payloads are JSON envelopes (ADR-0005). - **Events** (Protocol Buffers, in `packages/shared-types/events/`) — durable signals + feedback. Today: in-process `Bus` with a `onPublish` bridge to NATS JetStream when `NATS_URL` is set (ADR-0010). The in-proc bus stays the source of truth — JetStream is the durable mirror that cross-process consumers (`ml/serving`, future feature pipelines) tail. Proto schemas (ADR-0005) live in `packages/shared-types/events/oo/events/v1/`; `buf lint` + `buf breaking` run in CI on every PR touching those files (`.gitea/workflows/buf-check.yaml`).
- Do not redefine types per module. Regenerate from `shared-types`. - Do not redefine types per module. Regenerate from `shared-types`.
## Conventions ## Conventions
@@ -100,7 +100,7 @@ Ollama and LiteLLM are **shared Agap services**, not oO services — they live i
**M1 shipped. M2 (AI tips) in progress.** See `README.md` for the phase roadmap and `docs/architecture/` for diagrams. Work is tracked as Gitea milestones + issues on `alvis/oO`. **M1 shipped. M2 (AI tips) in progress.** See `README.md` for the phase roadmap and `docs/architecture/` for diagrams. Work is tracked as Gitea milestones + issues on `alvis/oO`.
Active work: AI tip generation pipeline — issues #86#93 in M2 milestone. Active work: bandit promotion (#99 — offline sim + ADR-0012 pending) and M2 issues (#61 freshness SLAs, #78 signal abstraction, #93 model benchmark).
## What NOT to do ## What NOT to do
@@ -112,3 +112,13 @@ Active work: AI tip generation pipeline — issues #86#93 in M2 milestone.
- Don't call LLMs directly from application code. All LLM calls go through `ml/serving` (Python) via `LITELLM_URL`. The TS recommender never holds a model name. - Don't call LLMs directly from application code. All LLM calls go through `ml/serving` (Python) via `LITELLM_URL`. The TS recommender never holds a model name.
- Don't embed MLflow/Airflow/OpenWebUI in the admin panel. They are external services; link out to them. The admin shell links to `o.alogins.net/mlflow`, `/airflow`, `ai.alogins.net`. - Don't embed MLflow/Airflow/OpenWebUI in the admin panel. They are external services; link out to them. The admin shell links to `o.alogins.net/mlflow`, `/airflow`, `ai.alogins.net`.
- Don't `nats.publish()` directly from feature code. All publishes go through the in-process `Bus` (`services/api/src/events/bus.ts`); the NATS adapter (`events/nats.ts`) bridges every publish to JetStream when `NATS_URL` is set. This keeps subscribers, the ring-buffer tail used by the admin event viewer, and JetStream all in lockstep. - Don't `nats.publish()` directly from feature code. All publishes go through the in-process `Bus` (`services/api/src/events/bus.ts`); the NATS adapter (`events/nats.ts`) bridges every publish to JetStream when `NATS_URL` is set. This keeps subscribers, the ring-buffer tail used by the admin event viewer, and JetStream all in lockstep.
## Admin app
`apps/admin` rewrites `/api/*``$NEXT_PUBLIC_API_URL/api/*` via `next.config.ts`. So `apiFetch('/admin/stats')` in `apps/admin/src/lib/api.ts` hits the Express backend, not a Next.js route.
Running `tsc --noEmit -p apps/admin/tsconfig.json` always reports `Cannot find module 'next'` errors — expected outside the Next.js build context; use `next build` for real type errors.
## Auth / session pattern
Sessions use an `sid` cookie. Admin routes stack `requireAuth` (sets `req.userId`) then `requireAdmin` (checks `role = 'admin'` in DB). Token-based admin auth: `POST /api/auth/token` with `{ token }` matching `ADMIN_TOKEN` env var sets the `sid` cookie — used by Playwright and CI.

View File

@@ -8,6 +8,15 @@ Next.js 15 app. Deployed at `admin.o.alogins.net` (dev: `http://localhost:3080`)
and checks `role === 'admin'`. First admin is seeded via `ADMIN_SEED_EMAIL` env var at API startup. and checks `role === 'admin'`. First admin is seeded via `ADMIN_SEED_EMAIL` env var at API startup.
- Admin write actions are appended to the `admin_actions` audit log in the DB. - Admin write actions are appended to the `admin_actions` audit log in the DB.
## Authentication
Two ways to sign in:
| Method | How |
|--------|-----|
| Google OAuth | Click "Sign in with Google" on the login page |
| Token | `POST /api/auth/token` with `{ token }` matching `ADMIN_TOKEN` env var; sets `sid` cookie valid for 24 h. Used by Playwright tests and CI automation. |
## Pages ## Pages
| Route | Description | | Route | Description |

View File

@@ -1,15 +1,67 @@
'use client';
import { useState } from 'react';
import { useRouter } from 'next/navigation';
export default function LoginPage() { export default function LoginPage() {
const router = useRouter();
const [token, setToken] = useState('');
const [error, setError] = useState('');
const [loading, setLoading] = useState(false);
async function handleTokenLogin(e: React.FormEvent) {
e.preventDefault();
setError('');
setLoading(true);
try {
const res = await fetch('/api/auth/token', {
method: 'POST',
credentials: 'include',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ token }),
});
if (!res.ok) {
const data = await res.json().catch(() => ({}));
setError((data as { error?: string }).error ?? 'Invalid token');
return;
}
router.push('/');
} catch {
setError('Request failed');
} finally {
setLoading(false);
}
}
return ( return (
<div className="flex min-h-screen items-center justify-center"> <div className="flex min-h-screen items-center justify-center">
<div className="text-center space-y-4"> <div className="text-center space-y-6 w-72">
<h1 className="text-2xl font-semibold">oO Admin</h1> <h1 className="text-2xl font-semibold">oO Admin</h1>
<p className="text-gray-400 text-sm">Sign in via the main app first, then return here.</p>
<a <a
href="/sign-in" href="/sign-in"
className="inline-block px-4 py-2 bg-white text-black rounded text-sm font-medium hover:bg-gray-200 transition-colors" className="inline-block px-4 py-2 bg-white text-black rounded text-sm font-medium hover:bg-gray-200 transition-colors"
> >
Sign in with Google Sign in with Google
</a> </a>
<form onSubmit={handleTokenLogin} className="space-y-3">
<input
type="password"
placeholder="Admin token"
value={token}
onChange={(e) => setToken(e.target.value)}
className="w-full px-3 py-2 bg-gray-900 border border-gray-700 rounded text-sm focus:outline-none focus:border-gray-500"
/>
{error && <p className="text-red-400 text-xs">{error}</p>}
<button
type="submit"
disabled={loading || !token}
className="w-full px-4 py-2 bg-gray-700 text-white rounded text-sm font-medium hover:bg-gray-600 disabled:opacity-40 transition-colors"
>
{loading ? 'Signing in…' : 'Sign in with token'}
</button>
</form>
</div> </div>
</div> </div>
); );

View File

@@ -0,0 +1,220 @@
'use client';
import { useEffect, useState } from 'react';
import { AdminShell } from '@/components/AdminShell';
import {
startSimulation,
getSimulationRuns,
getSimulationRun,
SimRun,
} from '@/lib/api';
const POLICIES = ['linucb-v1', 'egreedy-v1', 'egreedy-v2'];
const mlflowBase = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
const airflowBase = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
function mlflowRunUrl(runId: string) {
return `${mlflowBase}/#/experiments/1/runs/${runId}`;
}
function airflowRunUrl(dagRunId: string) {
return `${airflowBase}/dags/bandit_sim/grid?dag_run_id=${encodeURIComponent(dagRunId)}`;
}
function StatusBadge({ status }: { status: string }) {
const cls: Record<string, string> = {
running: 'bg-blue-900 text-blue-300 border-blue-800',
done: 'bg-green-900 text-green-300 border-green-800',
failed: 'bg-red-900 text-red-300 border-red-800',
pending: 'bg-gray-800 text-gray-400 border-gray-700',
};
return (
<span className={`text-xs px-2 py-0.5 rounded border ${cls[status] ?? cls.pending}`}>
{status}
</span>
);
}
function SummaryRow({ run }: { run: SimRun }) {
const summary = run.summaryJson ? JSON.parse(run.summaryJson) as Record<string, { total_reward: number; mean_reward: number; n_pulls: number }> : null;
return (
<div className="bg-gray-900 border border-gray-800 rounded p-4 space-y-2">
<div className="flex items-center justify-between">
<div className="space-y-0.5">
<div className="flex items-center gap-2">
<span className="font-mono text-xs text-gray-500">{run.id}</span>
<StatusBadge status={run.status} />
{run.winner && <span className="text-xs text-indigo-400">winner: {run.winner}</span>}
</div>
<div className="text-xs text-gray-600">
{run.nUsers}u × {run.nRounds}r × {run.tasksPerRound}t/r {run.judgeMode} judge
{' · '}{new Date(run.createdAt).toLocaleString()}
</div>
</div>
<div className="flex items-center gap-2 flex-shrink-0">
{run.mlflowRunId && (
<a href={mlflowRunUrl(run.mlflowRunId)} target="_blank" rel="noreferrer"
className="text-xs text-indigo-400 hover:underline">MLflow </a>
)}
{run.airflowDagRunId && (
<a href={airflowRunUrl(run.airflowDagRunId)} target="_blank" rel="noreferrer"
className="text-xs text-indigo-400 hover:underline">Airflow </a>
)}
</div>
</div>
{summary && (
<div className="grid grid-cols-2 gap-2 pt-1 lg:grid-cols-3">
{Object.entries(summary).map(([policy, s]) => (
<div key={policy} className={`rounded border p-2 text-xs ${policy === run.winner ? 'border-indigo-700 bg-indigo-950' : 'border-gray-800'}`}>
<div className="font-mono font-medium text-gray-300 mb-1">{policy}</div>
<div className="text-gray-500 space-y-0.5">
<div>total <span className="text-gray-300">{s.total_reward.toFixed(2)}</span></div>
<div>mean <span className="text-gray-300">{s.mean_reward.toFixed(4)}</span></div>
<div>pulls <span className="text-gray-300">{s.n_pulls}</span></div>
</div>
</div>
))}
</div>
)}
</div>
);
}
export default function SimulatePage() {
const [runs, setRuns] = useState<SimRun[]>([]);
const [loading, setLoading] = useState(true);
const [launching, setLaunching] = useState(false);
const [error, setError] = useState('');
const [msg, setMsg] = useState('');
const [nUsers, setNUsers] = useState(5);
const [nRounds, setNRounds] = useState(20);
const [tasksPerRound, setTasksPerRound] = useState(8);
const [judgeMode, setJudgeMode] = useState<'rule' | 'llm'>('rule');
const [selectedPolicies, setSelectedPolicies] = useState<string[]>(['linucb-v1', 'egreedy-v1']);
const refresh = () =>
getSimulationRuns()
.then((r) => setRuns(r.runs))
.catch((e) => setError(e.message))
.finally(() => setLoading(false));
useEffect(() => {
refresh();
const t = setInterval(refresh, 8_000);
return () => clearInterval(t);
}, []);
const togglePolicy = (p: string) =>
setSelectedPolicies((prev) =>
prev.includes(p) ? prev.filter((x) => x !== p) : [...prev, p],
);
const handleLaunch = async () => {
if (selectedPolicies.length < 2) { setError('Select at least 2 policies.'); return; }
setLaunching(true); setError(''); setMsg('');
try {
const r = await startSimulation({ nUsers, nRounds, tasksPerRound, judgeMode, policies: selectedPolicies });
setMsg(r.airflow_dag_run_id
? `Launched via Airflow — dag_run_id: ${r.airflow_dag_run_id}`
: `Launched locally — run id: ${r.id}`);
await refresh();
} catch (e: unknown) {
setError((e as Error).message);
} finally {
setLaunching(false);
}
};
return (
<AdminShell>
<div className="space-y-8 max-w-4xl">
<h1 className="text-xl font-semibold">Simulations</h1>
{error && <p className="text-red-400 text-sm">{error}</p>}
{msg && <p className="text-green-400 text-sm">{msg}</p>}
{/* Launch form */}
<section className="bg-gray-900 border border-gray-800 rounded p-5 space-y-4">
<h2 className="text-base font-medium text-gray-300">New simulation</h2>
<div className="grid grid-cols-3 gap-4 text-sm">
<label className="space-y-1">
<span className="text-gray-500">Users</span>
<input type="number" min={1} max={50} value={nUsers}
onChange={(e) => setNUsers(Number(e.target.value))}
className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
</label>
<label className="space-y-1">
<span className="text-gray-500">Rounds</span>
<input type="number" min={1} max={200} value={nRounds}
onChange={(e) => setNRounds(Number(e.target.value))}
className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
</label>
<label className="space-y-1">
<span className="text-gray-500">Tasks/round</span>
<input type="number" min={1} max={20} value={tasksPerRound}
onChange={(e) => setTasksPerRound(Number(e.target.value))}
className="w-full bg-gray-950 border border-gray-700 rounded px-2 py-1 text-gray-300" />
</label>
</div>
<div className="space-y-1 text-sm">
<span className="text-gray-500">Policies (select 2)</span>
<div className="flex gap-2 flex-wrap pt-1">
{POLICIES.map((p) => (
<button key={p} onClick={() => togglePolicy(p)}
className={`px-3 py-1 rounded border text-xs font-mono ${
selectedPolicies.includes(p)
? 'bg-indigo-900 border-indigo-700 text-indigo-200'
: 'border-gray-700 text-gray-500 hover:border-gray-500'
}`}>
{p}
</button>
))}
</div>
</div>
<div className="space-y-1 text-sm">
<span className="text-gray-500">Judge</span>
<div className="flex gap-2 pt-1">
{(['rule', 'llm'] as const).map((m) => (
<button key={m} onClick={() => setJudgeMode(m)}
className={`px-3 py-1 rounded border text-xs ${
judgeMode === m
? 'bg-gray-700 border-gray-500 text-white'
: 'border-gray-700 text-gray-500 hover:border-gray-500'
}`}>
{m}
</button>
))}
</div>
{judgeMode === 'llm' && (
<p className="text-xs text-yellow-600 mt-1">LLM judge requires ANTHROPIC_API_KEY in ml/serving env.</p>
)}
</div>
<button onClick={handleLaunch} disabled={launching}
className="bg-indigo-600 hover:bg-indigo-500 disabled:opacity-50 text-white rounded px-4 py-2 text-sm">
{launching ? 'Launching…' : 'Launch simulation'}
</button>
<p className="text-xs text-gray-600">
Runs via <a href={airflowBase} target="_blank" rel="noreferrer" className="text-indigo-500 hover:underline">Airflow</a> (mlops profile) when available; falls back to local subprocess.
Results logged to <a href={mlflowBase} target="_blank" rel="noreferrer" className="text-indigo-500 hover:underline">MLflow</a>.
</p>
</section>
{/* Run history */}
<section className="space-y-3">
<h2 className="text-base font-medium text-gray-300">
Run history
{loading && <span className="text-xs text-gray-600 ml-2">loading</span>}
</h2>
{runs.length === 0 && !loading && (
<p className="text-gray-600 text-sm">No simulations yet.</p>
)}
{runs.map((r) => <SummaryRow key={r.id} run={r} />)}
</section>
</div>
</AdminShell>
);
}

View File

@@ -2,6 +2,7 @@
import Link from 'next/link'; import Link from 'next/link';
import { usePathname } from 'next/navigation'; import { usePathname } from 'next/navigation';
import { useEffect, useState } from 'react';
const mlflowUrl = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow'; const mlflowUrl = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
const airflowUrl = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow'; const airflowUrl = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
@@ -10,6 +11,7 @@ type NavItem = {
href: string; href: string;
label: string; label: string;
external?: boolean; external?: boolean;
svcName?: string; // key in the health services map
}; };
type NavSection = { type NavSection = {
@@ -31,10 +33,11 @@ const NAV: NavSection[] = [
], ],
}, },
{ {
label: 'Recommender status', label: 'Recommender',
items: [ items: [
{ href: '/tips', label: 'Tips' }, { href: '/tips', label: 'Tips' },
{ href: '/reward-analytics', label: 'Rewards' }, { href: '/reward-analytics', label: 'Rewards' },
{ href: '/simulate', label: 'Simulations' },
], ],
}, },
{ {
@@ -50,14 +53,33 @@ const NAV: NavSection[] = [
label: 'Resources', label: 'Resources',
items: [ items: [
{ href: '/docs', label: 'Docs' }, { href: '/docs', label: 'Docs' },
{ href: mlflowUrl, label: 'MLflow ↗', external: true }, { href: mlflowUrl, label: 'MLflow ↗', external: true, svcName: 'mlflow' },
{ href: airflowUrl, label: 'Airflow ↗', external: true }, { href: airflowUrl, label: 'Airflow ↗', external: true, svcName: 'airflow' },
], ],
}, },
]; ];
const STATUS_DOT: Record<string, string> = {
ok: 'bg-green-500',
degraded: 'bg-yellow-400',
down: 'bg-red-500',
};
export function AdminShell({ children }: { children: React.ReactNode }) { export function AdminShell({ children }: { children: React.ReactNode }) {
const pathname = usePathname(); const pathname = usePathname();
const [svcStatus, setSvcStatus] = useState<Record<string, string>>({});
useEffect(() => {
fetch('/api/admin/health', { credentials: 'include' })
.then((r) => r.json())
.then((data: { services?: { name: string; status: string }[] }) => {
const map: Record<string, string> = {};
for (const s of data.services ?? []) map[s.name] = s.status;
setSvcStatus(map);
})
.catch(() => {});
}, []);
return ( return (
<div className="flex min-h-screen"> <div className="flex min-h-screen">
{/* Sidebar */} {/* Sidebar */}
@@ -83,13 +105,19 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
const active = const active =
!item.external && !item.external &&
(item.href === '/' ? pathname === '/' : pathname.startsWith(item.href)); (item.href === '/' ? pathname === '/' : pathname.startsWith(item.href));
const className = `flex items-center px-3 py-2 rounded text-sm transition-colors ${ const className = `flex items-center gap-2 px-3 py-2 rounded text-sm transition-colors ${
active active
? 'bg-gray-800 text-white font-medium' ? 'bg-gray-800 text-white font-medium'
: item.external : item.external
? 'text-gray-500 hover:text-white hover:bg-gray-900' ? 'text-gray-500 hover:text-white hover:bg-gray-900'
: 'text-gray-400 hover:text-white hover:bg-gray-900' : 'text-gray-400 hover:text-white hover:bg-gray-900'
}`; }`;
const dot = item.svcName
? svcStatus[item.svcName]
? <span className={`inline-block w-1.5 h-1.5 rounded-full flex-shrink-0 ${STATUS_DOT[svcStatus[item.svcName]] ?? STATUS_DOT.down}`} />
: <span className="inline-block w-1.5 h-1.5 rounded-full flex-shrink-0 bg-gray-700" />
: null;
return item.external ? ( return item.external ? (
<a <a
key={item.href} key={item.href}
@@ -98,6 +126,7 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
rel="noreferrer" rel="noreferrer"
className={className} className={className}
> >
{dot}
{item.label} {item.label}
</a> </a>
) : ( ) : (

View File

@@ -262,3 +262,49 @@ export function saveQuery(name: string, querySql: string) {
export function deleteSavedQuery(id: string) { export function deleteSavedQuery(id: string) {
return apiFetch<{ ok: boolean }>(`/admin/saved-queries/${id}`, { method: 'DELETE' }); return apiFetch<{ ok: boolean }>(`/admin/saved-queries/${id}`, { method: 'DELETE' });
} }
// ── Simulations ────────────────────────────────────────────────────────────
export interface SimRun {
id: string;
policyA: string;
policyB: string;
nUsers: number;
nRounds: number;
tasksPerRound: number;
judgeMode: string;
nPolicies: number;
status: 'pending' | 'running' | 'done' | 'failed';
summaryJson: string | null;
winner: string | null;
personaBreakdownJson: string | null;
airflowDagRunId: string | null;
mlflowRunId: string | null;
createdAt: string;
finishedAt: string | null;
}
export interface SimStartRequest {
nUsers?: number;
nRounds?: number;
tasksPerRound?: number;
judgeMode?: 'rule' | 'llm';
policies?: string[];
}
export function startSimulation(req: SimStartRequest) {
return apiFetch<{ id: string; status: string; airflow_dag_run_id?: string }>(
'/admin/simulate/start',
{ method: 'POST', body: JSON.stringify(req) },
);
}
export function getSimulationRuns() {
return apiFetch<{ runs: SimRun[] }>('/admin/simulate/runs');
}
export function getSimulationRun(id: string) {
return apiFetch<{ run: SimRun & { isRunning: boolean }; events: unknown[] }>(
`/admin/simulate/${id}`,
);
}

File diff suppressed because one or more lines are too long

View File

@@ -1,7 +1,7 @@
# ADR-0012 — ε-greedy v2: profile features in the bandit (D=7→12) # ADR-0012 — ε-greedy v2: profile features in the bandit (D=7→12)
**Status:** Accepted **Status:** Promoted
**Date:** 2026-04-25 **Date:** 2026-04-25 (accepted) / 2026-04-26 (promoted)
**Issue:** #99 **Issue:** #99
## Context ## Context
@@ -106,3 +106,19 @@ projecting theta without the corresponding `A` matrix cannot be done correctly.
the D=12 target in the issue spec and complicates the sim comparison. Deferred. the D=12 target in the issue spec and complicates the sim comparison. Deferred.
**In-place v1 promotion without shadow** — violates ADR-0002. **In-place v1 promotion without shadow** — violates ADR-0002.
## Promotion record (2026-04-26)
Offline sim (`runner.py --policies egreedy-v1 egreedy-v2 --judge rule --n-users 5 --n-rounds 20 --seed 42`):
| policy | total reward | mean reward | pulls |
|--------|-------------|-------------|-------|
| egreedy-v1 | 64.20 | 0.6420 | 100 |
| egreedy-v2 | 62.90 | 0.6290 | 100 |
**Gate passed** (v2 mean ≥ v1 mean). Per-persona: v2 wins deadline-driven, evening-relaxed, low-priority-first; v1 wins consistent-responder, overdue-ignorer.
Changes applied:
- `recommender.ts` `remotePolicy()`: `/score/egreedy``/score/egreedy/v2`
- `recommender.ts` `sendRewardWithRetry()`: `/reward/egreedy``/reward/egreedy/v2`, added `profile_features` to payload
- Shadow entry `egreedy-v2-shadow` left in registry (`active: false`) for rollback.

View File

@@ -1,21 +1,22 @@
FROM node:22-alpine AS base # syntax=docker/dockerfile:1.7
RUN npm install -g pnpm
FROM base AS deps FROM node:22-slim AS base
WORKDIR /app RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates \
COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./ && rm -rf /var/lib/apt/lists/* \
COPY packages/shared-types/package.json ./packages/shared-types/ && npm install -g pnpm
COPY apps/admin/package.json ./apps/admin/ ENV CI=true \
RUN pnpm install --frozen-lockfile PNPM_HOME=/pnpm \
PATH=/pnpm:$PATH
RUN pnpm config set store-dir /pnpm/store
FROM base AS builder FROM base AS builder
WORKDIR /app WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules COPY pnpm-lock.yaml ./
COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm fetch
COPY --from=deps /app/apps/admin/node_modules ./apps/admin/node_modules COPY . .
COPY tsconfig.base.json ./ RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
COPY packages/shared-types ./packages/shared-types pnpm install --frozen-lockfile --offline \
COPY apps/admin ./apps/admin --filter @oo/admin... --filter @oo/shared-types
RUN pnpm --filter @oo/shared-types build RUN pnpm --filter @oo/shared-types build
ARG NEXT_PUBLIC_MLFLOW_URL=/mlflow ARG NEXT_PUBLIC_MLFLOW_URL=/mlflow
ARG NEXT_PUBLIC_AIRFLOW_URL=/airflow ARG NEXT_PUBLIC_AIRFLOW_URL=/airflow
@@ -24,7 +25,7 @@ ENV NEXT_TELEMETRY_DISABLED=1 \
NEXT_PUBLIC_AIRFLOW_URL=$NEXT_PUBLIC_AIRFLOW_URL NEXT_PUBLIC_AIRFLOW_URL=$NEXT_PUBLIC_AIRFLOW_URL
RUN pnpm --filter @oo/admin build RUN pnpm --filter @oo/admin build
FROM node:22-alpine AS runner FROM node:22-slim AS runner
ENV NODE_ENV=production NEXT_TELEMETRY_DISABLED=1 PORT=3080 ENV NODE_ENV=production NEXT_TELEMETRY_DISABLED=1 PORT=3080
WORKDIR /app WORKDIR /app
COPY --from=builder /app/apps/admin/.next/standalone ./ COPY --from=builder /app/apps/admin/.next/standalone ./

View File

@@ -1,32 +1,35 @@
FROM node:22-alpine AS base # syntax=docker/dockerfile:1.7
RUN npm install -g pnpm
FROM base AS deps FROM node:22-slim AS base
WORKDIR /app RUN apt-get update && apt-get install -y --no-install-recommends \
COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./ python3 make g++ ca-certificates \
COPY packages/shared-types/package.json ./packages/shared-types/ && rm -rf /var/lib/apt/lists/* \
COPY services/api/package.json ./services/api/ && npm install -g pnpm
RUN pnpm install --frozen-lockfile ENV CI=true \
PNPM_HOME=/pnpm \
PATH=/pnpm:$PATH
RUN pnpm config set store-dir /pnpm/store
FROM base AS builder FROM base AS builder
WORKDIR /app WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules COPY pnpm-lock.yaml ./
COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm fetch
COPY --from=deps /app/services/api/node_modules ./services/api/node_modules COPY . .
COPY tsconfig.base.json ./ RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
COPY packages/shared-types ./packages/shared-types pnpm install --frozen-lockfile --offline \
COPY services/api ./services/api --filter @oo/api... --filter @oo/shared-types
RUN pnpm --filter @oo/shared-types build RUN pnpm --filter @oo/shared-types build
RUN pnpm --filter @oo/api build RUN pnpm --filter @oo/api build
RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
pnpm --filter @oo/api --prod deploy --legacy /deploy \
&& cp -r services/api/dist /deploy/dist \
&& rm -rf /deploy/node_modules/@oo/shared-types/src \
&& cp -r packages/shared-types/dist /deploy/node_modules/@oo/shared-types/dist
FROM node:22-alpine AS runner FROM node:22-slim AS runner
WORKDIR /app WORKDIR /app
RUN npm install -g pnpm ENV NODE_ENV=production
COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./ COPY --from=builder /deploy/package.json ./
COPY packages/shared-types/package.json ./packages/shared-types/ COPY --from=builder /deploy/node_modules ./node_modules
COPY services/api/package.json ./services/api/ COPY --from=builder /deploy/dist ./dist
RUN pnpm install --prod --frozen-lockfile
COPY --from=builder /app/packages/shared-types/dist ./packages/shared-types/dist
COPY --from=builder /app/services/api/dist ./services/api/dist
WORKDIR /app/services/api
CMD ["node", "dist/index.js"] CMD ["node", "dist/index.js"]

View File

@@ -2,5 +2,5 @@ FROM python:3.12-slim
WORKDIR /app WORKDIR /app
COPY ml/serving/requirements.txt . COPY ml/serving/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt RUN pip install --no-cache-dir -r requirements.txt
COPY ml/serving/main.py . COPY ml/serving/*.py .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

View File

@@ -11,12 +11,18 @@ services:
env_file: ../../.env.local env_file: ../../.env.local
environment: environment:
NODE_ENV: production NODE_ENV: production
ML_SERVING_URL: "http://ml-serving:8000"
MLFLOW_URL: "http://mlflow:5000"
AIRFLOW_URL: "http://airflow-webserver:8080"
AIRFLOW_API_USER: "admin"
AIRFLOW_API_PASSWORD: "${AIRFLOW_ADMIN_PASSWORD:-admin}"
INTERNAL_API_TOKEN: "${INTERNAL_API_TOKEN:-}"
volumes: volumes:
- /mnt/ssd/dbs/oo:/mnt/ssd/dbs/oo - /mnt/ssd/dbs/oo:/mnt/ssd/dbs/oo
ports: ports:
- "127.0.0.1:3078:3078" - "127.0.0.1:3078:3078"
healthcheck: healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:3078/health"] test: ["CMD", "node", "-e", "fetch('http://localhost:3078/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"]
interval: 10s interval: 10s
timeout: 5s timeout: 5s
retries: 5 retries: 5
@@ -49,6 +55,8 @@ services:
PORT: "3080" PORT: "3080"
HOSTNAME: "0.0.0.0" HOSTNAME: "0.0.0.0"
NEXT_PUBLIC_API_URL: "" NEXT_PUBLIC_API_URL: ""
NEXT_PUBLIC_MLFLOW_URL: "/mlflow"
NEXT_PUBLIC_AIRFLOW_URL: "/airflow"
INTERNAL_API_URL: "http://api:3078" INTERNAL_API_URL: "http://api:3078"
ports: ports:
- "127.0.0.1:3080:3080" - "127.0.0.1:3080:3080"
@@ -133,8 +141,14 @@ services:
AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY:-change-me-in-prod} AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY:-change-me-in-prod}
AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-} AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
AIRFLOW__WEBSERVER__BASE_URL: ${AIRFLOW_BASE_URL:-https://o.alogins.net/airflow} AIRFLOW__WEBSERVER__BASE_URL: ${AIRFLOW_BASE_URL:-https://o.alogins.net/airflow}
AIRFLOW__API__AUTH_BACKENDS: "airflow.api.auth.backend.basic_auth"
_PIP_ADDITIONAL_REQUIREMENTS: "mlflow==2.14.3 httpx"
MLFLOW_TRACKING_URI: "http://mlflow:5000/mlflow"
MLFLOW_TRACKING_USERNAME: "admin"
MLFLOW_TRACKING_PASSWORD: "${MLFLOW_ADMIN_PASSWORD:-password}"
volumes: volumes:
- ../../ml/pipelines:/opt/airflow/dags:ro - ../../ml/pipelines:/opt/airflow/dags:ro
- ../../ml:/opt/airflow/ml:ro
ports: ports:
- "127.0.0.1:8080:8080" - "127.0.0.1:8080:8080"
depends_on: depends_on:
@@ -155,8 +169,13 @@ services:
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${AIRFLOW_DB_PASSWORD:-airflow}@airflow-db/airflow AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${AIRFLOW_DB_PASSWORD:-airflow}@airflow-db/airflow
AIRFLOW__CORE__EXECUTOR: LocalExecutor AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-} AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
_PIP_ADDITIONAL_REQUIREMENTS: "mlflow==2.14.3 httpx"
MLFLOW_TRACKING_URI: "http://mlflow:5000/mlflow"
MLFLOW_TRACKING_USERNAME: "admin"
MLFLOW_TRACKING_PASSWORD: "${MLFLOW_ADMIN_PASSWORD:-password}"
volumes: volumes:
- ../../ml/pipelines:/opt/airflow/dags:ro - ../../ml/pipelines:/opt/airflow/dags:ro
- ../../ml:/opt/airflow/ml:ro
depends_on: depends_on:
airflow-init: airflow-init:
condition: service_completed_successfully condition: service_completed_successfully

View File

@@ -4,8 +4,8 @@ Python. Owns models, features, training, online scoring.
| Dir | Role | Phase | | Dir | Role | Phase |
|---|---|---| |---|---|---|
| `serving/` | FastAPI online scorer (`/score`, `/generate`) + LiteLLM gateway + prompt registry (`prompts.py`), called by `recommender` | 12 | | `serving/` | FastAPI online scorer (`/score`, `/generate`) + LiteLLM gateway + prompt registry (`prompts.py`) + JetStream consumers for `signals.>` / `feedback.>`, called by `recommender` | 12 |
| `features/` | context assembler (`context.py`): signals → `PromptContext`; Feast adapter later | 2 | | `features/` | context assembler (`context.py`): signals → `PromptContext`; profile-feature schema mirror (`profile_schema.py`); Feast adapter later | 2 |
| `pipelines/` | batch feature + training DAGs (Prefect/Airflow) | 4 | | `pipelines/` | batch feature + training DAGs (Prefect/Airflow) | 4 |
| `registry/` | MLflow-backed model registry integration | 4 | | `registry/` | MLflow-backed model registry integration | 4 |
| `experiments/` | A/B assignment + multi-armed bandit policies | 4 | | `experiments/` | A/B assignment + multi-armed bandit policies | 4 |
@@ -18,14 +18,24 @@ Python. Owns models, features, training, online scoring.
- Training reads from the offline feature store; serving reads from the online feature store; definitions are shared (no train/serve skew). - Training reads from the offline feature store; serving reads from the online feature store; definitions are shared (no train/serve skew).
- Shadow deploys before any policy change that affects real users. - Shadow deploys before any policy change that affects real users.
## Profile-feature contract ## Feature contract
### Profile features (batched)
User-level features (completion rate, preferred hour, tip volume…) are computed User-level features (completion rate, preferred hour, tip volume…) are computed
by the TypeScript recommender and shipped to ml/serving on every `/score` and by the TypeScript recommender and shipped to `ml/serving` on every `/score` and
`/generate` call as `profile_features: dict | None`. The Python mirror in `/generate` call as `profile_features: dict | None`. The Python mirror in
`features/profile_schema.py` documents the available names + dtypes — keep it `features/profile_schema.py` documents each feature's name, dtype, TTL, source,
in sync with `services/api/src/profile/registry.ts` (a CI-style test asserts and null fallback — keep it in sync with `services/api/src/profile/registry.ts`
the name sets match). See ADR-0011. (a CI-style test asserts names and `ttlSec` values match). See ADR-0011.
### Context features (JIT)
Request-time signals assembled by `features/context.py` (`hour_of_day`,
`day_of_week`, task list). These are never cached — they are derived from the
system clock and the live Todoist feed at the moment of the score call.
`CONTEXT_FEATURES` in `context.py` declares freshness, source, and fallback for
each field (issue #61).
## Prompt registry ## Prompt registry

View File

@@ -26,6 +26,7 @@ from __future__ import annotations
import argparse import argparse
import json import json
import os
import random import random
import sys import sys
import time import time
@@ -40,6 +41,12 @@ from llm_judge import ACTIONS, infer_reward, judge
from personas import PERSONAS, Persona from personas import PERSONAS, Persona
from task_generator import generate_task_pool from task_generator import generate_task_pool
try:
import mlflow
_MLFLOW_AVAILABLE = True
except ImportError:
_MLFLOW_AVAILABLE = False
POLICY_SCORE_ENDPOINTS: dict[str, str] = { POLICY_SCORE_ENDPOINTS: dict[str, str] = {
"linucb-v1": "/score", "linucb-v1": "/score",
"egreedy-v1": "/score/egreedy", "egreedy-v1": "/score/egreedy",
@@ -107,14 +114,30 @@ def _call_reward(
# ── Standard single-pass runner (rule / llm modes) ───────────────────────── # ── Standard single-pass runner (rule / llm modes) ─────────────────────────
def _init_mlflow(mlflow_url: str | None, experiment: str) -> str | None:
"""Set up MLflow tracking and return the active run_id, or None if unavailable."""
if not _MLFLOW_AVAILABLE or not mlflow_url:
return None
try:
mlflow.set_tracking_uri(mlflow_url)
mlflow.set_experiment(experiment)
return "ready"
except Exception as e:
print(f" [warn] MLflow init failed: {e}", file=sys.stderr)
return None
def run_simulation( def run_simulation(
n_users: int, n_rounds: int, tasks_per_round: int, n_users: int, n_rounds: int, tasks_per_round: int,
ml_url: str, policies: list[str], use_llm: bool, seed: int, ml_url: str, policies: list[str], use_llm: bool, seed: int,
mlflow_url: str | None = None, mlflow_experiment: str = "bandit_simulation",
) -> dict: ) -> dict:
rng = random.Random(seed) rng = random.Random(seed)
run_id = str(uuid.uuid4())[:8] run_id = str(uuid.uuid4())[:8]
started_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()) started_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
_init_mlflow(mlflow_url, mlflow_experiment)
user_personas = [ user_personas = [
(f"sim-{run_id}-u{i}", PERSONAS[i % len(PERSONAS)]) (f"sim-{run_id}-u{i}", PERSONAS[i % len(PERSONAS)])
for i in range(n_users) for i in range(n_users)
@@ -130,6 +153,26 @@ def run_simulation(
} }
events: list[dict] = [] events: list[dict] = []
mlflow_run_id: str | None = None
mlflow_ctx = (
mlflow.start_run(run_name=run_id)
if (_MLFLOW_AVAILABLE and mlflow_url)
else None
)
try:
if mlflow_ctx:
active = mlflow_ctx.__enter__()
mlflow_run_id = active.info.run_id
mlflow.log_params({
"n_users": n_users,
"n_rounds": n_rounds,
"tasks_per_round": tasks_per_round,
"policies": ",".join(policies),
"judge": "llm" if use_llm else "rule",
"seed": seed,
})
with httpx.Client(trust_env=False) as client: with httpx.Client(trust_env=False) as client:
for rnd in range(n_rounds): for rnd in range(n_rounds):
hour = rng.randint(6, 22) hour = rng.randint(6, 22)
@@ -139,8 +182,6 @@ def run_simulation(
for user_id, persona in user_personas: for user_id, persona in user_personas:
seed_tasks = rnd * 997 + abs(hash(user_id)) % 997 seed_tasks = rnd * 997 + abs(hash(user_id)) % 997
tasks = generate_task_pool(n=tasks_per_round, seed=seed_tasks) tasks = generate_task_pool(n=tasks_per_round, seed=seed_tasks)
# Per-persona profile features for v2 (synthetic for sim — see ADR-0012)
profile = persona.profile_features(hour) if hasattr(persona, "profile_features") else None profile = persona.profile_features(hour) if hasattr(persona, "profile_features") else None
for policy in policies: for policy in policies:
@@ -179,13 +220,34 @@ def run_simulation(
prev = acc[p]["cumulative_rewards"][-1] if acc[p]["cumulative_rewards"] else 0.0 prev = acc[p]["cumulative_rewards"][-1] if acc[p]["cumulative_rewards"] else 0.0
acc[p]["cumulative_rewards"].append(prev + round_rewards[p]) acc[p]["cumulative_rewards"].append(prev + round_rewards[p])
if mlflow_ctx:
for p in policies:
mlflow.log_metric(f"{p}_cumulative_reward",
acc[p]["cumulative_rewards"][-1], step=rnd)
mode = "llm" if use_llm else "rule" mode = "llm" if use_llm else "rule"
print(f" Round {rnd+1:>3}/{n_rounds} [{mode}] " + " ".join( print(f" Round {rnd+1:>3}/{n_rounds} [{mode}] " + " ".join(
f"{p}={acc[p]['cumulative_rewards'][-1]:+.2f}" for p in policies f"{p}={acc[p]['cumulative_rewards'][-1]:+.2f}" for p in policies
)) ))
return _build_result(run_id, started_at, policies, acc, events, result = _build_result(run_id, started_at, policies, acc, events,
n_users, n_rounds, tasks_per_round, use_llm, seed) n_users, n_rounds, tasks_per_round, use_llm, seed)
result["mlflow_run_id"] = mlflow_run_id
if mlflow_ctx:
for p, s in result["summary"].items():
mlflow.log_metrics({
f"{p}_total_reward": s["total_reward"],
f"{p}_mean_reward": s["mean_reward"],
f"{p}_n_pulls": s["n_pulls"],
})
mlflow.set_tag("winner", result["winner"])
return result
finally:
if mlflow_ctx:
mlflow_ctx.__exit__(None, None, None)
# ── Claude Code judge — phase 1: score ───────────────────────────────────── # ── Claude Code judge — phase 1: score ─────────────────────────────────────
@@ -494,6 +556,9 @@ if __name__ == "__main__":
help="Alias for --judge rule (backwards compat)") help="Alias for --judge rule (backwards compat)")
parser.add_argument("--seed", type=int, default=42) parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--out", default=None) parser.add_argument("--out", default=None)
parser.add_argument("--mlflow-url", default=os.environ.get("MLFLOW_TRACKING_URI"),
help="MLflow tracking URI (e.g. http://mlflow:5000/mlflow)")
parser.add_argument("--mlflow-experiment", default="bandit_simulation")
args = parser.parse_args() args = parser.parse_args()
if args.no_llm: if args.no_llm:
@@ -534,6 +599,7 @@ if __name__ == "__main__":
n_users=args.n_users, n_rounds=args.n_rounds, n_users=args.n_users, n_rounds=args.n_rounds,
tasks_per_round=args.tasks_per_round, ml_url=args.ml_url, tasks_per_round=args.tasks_per_round, ml_url=args.ml_url,
policies=args.policies, use_llm=use_llm, seed=args.seed, policies=args.policies, use_llm=use_llm, seed=args.seed,
mlflow_url=args.mlflow_url, mlflow_experiment=args.mlflow_experiment,
) )
Path(out_path).write_text(json.dumps(result, indent=2)) Path(out_path).write_text(json.dumps(result, indent=2))
print() print()

View File

@@ -1,3 +1,8 @@
from .context import build_context, PromptContext, TaskSignal from .context import build_context, PromptContext, TaskSignal, ContextFeatureSpec, CONTEXT_FEATURES
from .profile_schema import ProfileFeature, PROFILE_FEATURES, feature_names
__all__ = ["build_context", "PromptContext", "TaskSignal"] __all__ = [
"build_context", "PromptContext", "TaskSignal",
"ContextFeatureSpec", "CONTEXT_FEATURES",
"ProfileFeature", "PROFILE_FEATURES", "feature_names",
]

View File

@@ -2,12 +2,56 @@
Context assembler — converts raw user signals into a PromptContext for LLM tip generation. Context assembler — converts raw user signals into a PromptContext for LLM tip generation.
Usage: Usage:
from ml.features.context import build_context from ml.features.context import build_context, CONTEXT_FEATURES
ctx = build_context(tasks, hour_of_day=9, day_of_week=2) ctx = build_context(tasks, hour_of_day=9, day_of_week=2)
Feature-spec (issue #61):
All context features are JIT — they are assembled at request time from live
sources (system clock, caller-supplied task list) rather than read from a
cached profile store. They carry no TTL because they are never persisted.
""" """
from __future__ import annotations from __future__ import annotations
from dataclasses import dataclass, field from dataclasses import dataclass, field
from typing import Literal
@dataclass(frozen=True)
class ContextFeatureSpec:
name: str
dtype: Literal["numeric", "categorical", "list"]
freshness: Literal["jit", "batched"]
source: str
fallback: str
description: str
CONTEXT_FEATURES: tuple[ContextFeatureSpec, ...] = (
ContextFeatureSpec(
name="hour_of_day",
dtype="numeric",
freshness="jit",
source="request",
fallback="12",
description="Current hour (023), supplied by the caller at score time.",
),
ContextFeatureSpec(
name="day_of_week",
dtype="numeric",
freshness="jit",
source="request",
fallback="0",
description="ISO weekday (0=Monday … 6=Sunday), supplied by the caller at score time.",
),
ContextFeatureSpec(
name="tasks",
dtype="list",
freshness="jit",
source="todoist-integration",
fallback="[]",
description="User's open tasks fetched live from the Todoist integration at request time.",
),
)
@dataclass @dataclass

View File

@@ -8,6 +8,12 @@ code (ml/serving, eval harnesses, notebooks) knows what fields to expect on
Update this file whenever you add or rename a feature in the TS registry. Update this file whenever you add or rename a feature in the TS registry.
The accompanying test asserts the two stay in sync at the name level. The accompanying test asserts the two stay in sync at the name level.
Feature-spec fields (issue #61):
freshness — "batched": value cached in profile store, recomputed on TTL/event.
ttl_sec — cache lifetime in seconds; mirrors ``ttlSec`` in registry.ts.
source — where the value originates.
fallback — raw value returned when the feature is unavailable (null stored).
""" """
from __future__ import annotations from __future__ import annotations
@@ -16,6 +22,10 @@ from typing import Literal
Dtype = Literal["numeric", "categorical"] Dtype = Literal["numeric", "categorical"]
Freshness = Literal["jit", "batched"]
_HOUR = 3600
_DAY = 86_400
@dataclass(frozen=True) @dataclass(frozen=True)
@@ -23,28 +33,57 @@ class ProfileFeature:
name: str name: str
dtype: Dtype dtype: Dtype
description: str description: str
freshness: Freshness
ttl_sec: int
source: str
fallback: str
PROFILE_FEATURES: tuple[ProfileFeature, ...] = ( PROFILE_FEATURES: tuple[ProfileFeature, ...] = (
ProfileFeature( ProfileFeature(
"completion_rate_30d", "numeric", name="completion_rate_30d",
'Fraction of tips served in the last 30 days that received a "done" reaction.', dtype="numeric",
description='Fraction of tips served in the last 30 days that received a "done" reaction.',
freshness="batched",
ttl_sec=6 * _HOUR,
source="profile_store",
fallback="0.0",
), ),
ProfileFeature( ProfileFeature(
"dismiss_rate_30d", "numeric", name="dismiss_rate_30d",
'Fraction of tips served in the last 30 days that received a "dismiss" reaction.', dtype="numeric",
description='Fraction of tips served in the last 30 days that received a "dismiss" reaction.',
freshness="batched",
ttl_sec=6 * _HOUR,
source="profile_store",
fallback="0.0",
), ),
ProfileFeature( ProfileFeature(
"mean_dwell_ms_30d", "numeric", name="mean_dwell_ms_30d",
"Average dwell time (ms between served and reacted) over the last 30 days.", dtype="numeric",
description="Average dwell time (ms between served and reacted) over the last 30 days.",
freshness="batched",
ttl_sec=6 * _HOUR,
source="profile_store",
fallback="null — serving normalises to 0.0",
), ),
ProfileFeature( ProfileFeature(
"preferred_hour", "numeric", name="preferred_hour",
'Hour-of-day with the most "done" reactions in the last 30 days (0-23).', dtype="numeric",
description='Hour-of-day with the most "done" reactions in the last 30 days (023).',
freshness="batched",
ttl_sec=_DAY,
source="profile_store",
fallback="null — serving normalises to 0.5 (neutral alignment)",
), ),
ProfileFeature( ProfileFeature(
"tip_volume_30d", "numeric", name="tip_volume_30d",
"Number of tips served to the user in the last 30 days.", dtype="numeric",
description="Number of tips served to the user in the last 30 days.",
freshness="batched",
ttl_sec=_HOUR,
source="profile_store",
fallback="0",
), ),
) )

View File

@@ -1,7 +1,7 @@
"""Tests for ml/features/context.py""" """Tests for ml/features/context.py"""
import pytest import pytest
import sys, os; sys.path.insert(0, os.path.dirname(__file__)) import sys, os; sys.path.insert(0, os.path.dirname(__file__))
from context import build_context, TaskSignal, PromptContext from context import build_context, TaskSignal, PromptContext, CONTEXT_FEATURES
def test_empty_tasks(): def test_empty_tasks():
@@ -62,3 +62,30 @@ def test_due_date_none_preserved():
tasks = [TaskSignal(id="x", content="No due", due_date=None)] tasks = [TaskSignal(id="x", content="No due", due_date=None)]
ctx = build_context(tasks) ctx = build_context(tasks)
assert ctx.tasks[0]["due_date"] is None assert ctx.tasks[0]["due_date"] is None
# ── CONTEXT_FEATURES spec tests (issue #61) ──────────────────────────────────
def test_context_features_expected_names():
names = {f.name for f in CONTEXT_FEATURES}
assert names == {"hour_of_day", "day_of_week", "tasks"}
def test_context_features_all_jit():
for f in CONTEXT_FEATURES:
assert f.freshness == "jit", f"{f.name}: expected freshness='jit', got {f.freshness!r}"
def test_context_features_source_set():
for f in CONTEXT_FEATURES:
assert f.source, f"{f.name}: source must not be empty"
def test_context_features_fallback_set():
for f in CONTEXT_FEATURES:
assert f.fallback, f"{f.name}: fallback must not be empty"
def test_context_features_no_duplicates():
names = [f.name for f in CONTEXT_FEATURES]
assert len(names) == len(set(names)), f"duplicate names: {names}"

View File

@@ -1,4 +1,4 @@
"""Smoke test for profile_schema mirror (#81 phase A). """Smoke test for profile_schema mirror (#81 phase A, #61 freshness spec).
The TS registry in services/api/src/profile/registry.ts is the source of truth. The TS registry in services/api/src/profile/registry.ts is the source of truth.
This test checks the names listed here match the registry by reading the TS This test checks the names listed here match the registry by reading the TS
@@ -14,6 +14,18 @@ from ml.features.profile_schema import PROFILE_FEATURES, feature_names
REGISTRY_PATH = Path(__file__).resolve().parents[2] / "services" / "api" / "src" / "profile" / "registry.ts" REGISTRY_PATH = Path(__file__).resolve().parents[2] / "services" / "api" / "src" / "profile" / "registry.ts"
_HOUR = 3600
_DAY = 86_400
# Expected ttl_sec values mirrored from registry.ts — keeps the two in sync.
_EXPECTED_TTL: dict[str, int] = {
"completion_rate_30d": 6 * _HOUR,
"dismiss_rate_30d": 6 * _HOUR,
"mean_dwell_ms_30d": 6 * _HOUR,
"preferred_hour": _DAY,
"tip_volume_30d": _HOUR,
}
def _ts_registry_names() -> set[str]: def _ts_registry_names() -> set[str]:
text = REGISTRY_PATH.read_text(encoding="utf-8") text = REGISTRY_PATH.read_text(encoding="utf-8")
@@ -21,6 +33,35 @@ def _ts_registry_names() -> set[str]:
return set(re.findall(r"name:\s*'([a-zA-Z0-9_]+)'", text)) return set(re.findall(r"name:\s*'([a-zA-Z0-9_]+)'", text))
def _ts_registry_ttls() -> dict[str, int]:
"""Parse ttlSec values from registry.ts (crude but sufficient for drift detection).
Handles TS symbolic constants (HOUR, DAY) and expressions like ``6 * HOUR``.
"""
text = REGISTRY_PATH.read_text(encoding="utf-8")
# Extract numeric constants: `const HOUR = 3600;` or `const DAY = 86_400;`
consts: dict[str, int] = {}
for m in re.finditer(r"const\s+([A-Z_]+)\s*=\s*([\d_]+)", text):
consts[m.group(1)] = int(m.group(2).replace("_", ""))
def _eval_expr(expr: str) -> int:
tokens = [t.strip() for t in expr.split("*")]
result = 1
for t in tokens:
result *= consts[t] if t in consts else int(t)
return result
result: dict[str, int] = {}
for block in re.split(r"\{", text):
name_m = re.search(r"name:\s*'([a-zA-Z0-9_]+)'", block)
# ttlSec may be a constant name, a number, or `N * CONST`
ttl_m = re.search(r"ttlSec:\s*([A-Za-z0-9_]+(?:\s*\*\s*[A-Za-z0-9_]+)?)", block)
if name_m and ttl_m:
result[name_m.group(1)] = _eval_expr(ttl_m.group(1))
return result
def test_python_mirror_matches_ts_registry(): def test_python_mirror_matches_ts_registry():
py_names = feature_names() py_names = feature_names()
ts_names = _ts_registry_names() ts_names = _ts_registry_names()
@@ -39,3 +80,34 @@ def test_profile_schema_no_duplicates():
def test_profile_schema_dtypes_known(): def test_profile_schema_dtypes_known():
for f in PROFILE_FEATURES: for f in PROFILE_FEATURES:
assert f.dtype in {"numeric", "categorical"} assert f.dtype in {"numeric", "categorical"}
def test_all_profile_features_are_batched():
for f in PROFILE_FEATURES:
assert f.freshness == "batched", f"{f.name}: expected freshness='batched', got {f.freshness!r}"
def test_profile_feature_ttl_matches_ts_registry():
ts_ttls = _ts_registry_ttls()
for f in PROFILE_FEATURES:
assert f.name in ts_ttls, f"{f.name} not found in TS registry ttlSec parse"
assert f.ttl_sec == ts_ttls[f.name], (
f"{f.name}: Python ttl_sec={f.ttl_sec} != TS ttlSec={ts_ttls[f.name]}"
)
def test_profile_feature_ttl_matches_expected():
for f in PROFILE_FEATURES:
assert f.ttl_sec == _EXPECTED_TTL[f.name], (
f"{f.name}: ttl_sec={f.ttl_sec}, expected {_EXPECTED_TTL[f.name]}"
)
def test_profile_feature_source_is_profile_store():
for f in PROFILE_FEATURES:
assert f.source == "profile_store", f"{f.name}: unexpected source {f.source!r}"
def test_profile_feature_fallback_set():
for f in PROFILE_FEATURES:
assert f.fallback, f"{f.name}: fallback must not be empty"

124
ml/pipelines/sim_dag.py Normal file
View File

@@ -0,0 +1,124 @@
"""
Airflow DAG: bandit_sim
Runs a bandit policy simulation and logs results to MLflow.
Triggered on-demand from the oO admin panel or manually from the Airflow UI.
Required conf keys (passed via dag_run.conf):
sim_run_id str — oO SQLite run ID for callback correlation
n_users int — number of synthetic users
n_rounds int — rounds per user
tasks_per_round int — candidate pool size per round
policies list — policy names to compare
judge_mode str — "rule" | "llm"
ml_url str — ml/serving URL (e.g. http://ml-serving:8000)
mlflow_url str — MLflow tracking URI (e.g. http://mlflow:5000/mlflow)
callback_url str — oO API callback endpoint
internal_token str — x-internal-token header value
"""
from __future__ import annotations
import json
import os
import sys
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
def _run_sim(**context: object) -> dict:
conf: dict = context["dag_run"].conf or {}
n_users = int(conf.get("n_users", 5))
n_rounds = int(conf.get("n_rounds", 20))
tasks_per_round = int(conf.get("tasks_per_round", 8))
policies = list(conf.get("policies", ["linucb-v1", "egreedy-v1"]))
judge_mode = str(conf.get("judge_mode", "rule"))
ml_url = str(conf.get("ml_url", "http://ml-serving:8000"))
mlflow_url = str(conf.get("mlflow_url", os.environ.get("MLFLOW_TRACKING_URI", "")))
mlflow_experiment = "bandit_simulation"
sys.path.insert(0, "/opt/airflow/ml/experiments/sim")
from runner import run_simulation # type: ignore[import]
use_llm = judge_mode == "llm"
result = run_simulation(
n_users=n_users,
n_rounds=n_rounds,
tasks_per_round=tasks_per_round,
ml_url=ml_url,
policies=policies,
use_llm=use_llm,
seed=42,
mlflow_url=mlflow_url or None,
mlflow_experiment=mlflow_experiment,
)
return result
def _callback(**context: object) -> None:
import httpx
conf: dict = context["dag_run"].conf or {}
callback_url: str = str(conf.get("callback_url", ""))
internal_token: str = str(conf.get("internal_token", ""))
if not callback_url or not internal_token:
print("No callback_url or internal_token — skipping result push.", flush=True)
return
result: dict = context["ti"].xcom_pull(task_ids="run_sim")
if not result:
print("No result from run_sim task — callback skipped.", flush=True)
return
payload = {
"summary": result.get("summary", {}),
"winner": result.get("winner", ""),
"persona_breakdown": result.get("persona_breakdown", {}),
"events": result.get("events", []),
"mlflow_run_id": result.get("mlflow_run_id"),
}
try:
r = httpx.post(
callback_url,
json=payload,
headers={"x-internal-token": internal_token},
timeout=30.0,
)
r.raise_for_status()
print(f"Callback OK: {r.status_code}", flush=True)
except Exception as exc:
print(f"Callback failed: {exc}", flush=True)
raise
with DAG(
dag_id="bandit_sim",
description="On-demand bandit policy simulation with MLflow tracking",
schedule_interval=None,
start_date=datetime(2025, 1, 1),
catchup=False,
tags=["bandit", "simulation", "ml"],
default_args={
"retries": 1,
"retry_delay": timedelta(minutes=2),
},
) as dag:
run_sim = PythonOperator(
task_id="run_sim",
python_callable=_run_sim,
provide_context=True,
)
push_results = PythonOperator(
task_id="push_results",
python_callable=_callback,
provide_context=True,
)
run_sim >> push_results

104
ml/serving/README.md Normal file
View File

@@ -0,0 +1,104 @@
# ml/serving
FastAPI online scorer, tip generator, and JetStream consumer.
## Contract
| Endpoint | Description |
|----------|-------------|
| `POST /score` | LinUCB d=5 (baseline, shadow-eligible) |
| `POST /score/egreedy` | ε-greedy v1, d=7 (active policy — ADR-0007) |
| `POST /score/egreedy/v2` | ε-greedy v2, d=12 + profile features (shadow — ADR-0012) |
| `POST /reward` / `/reward/egreedy` / `/reward/egreedy/v2` | Online reward update per policy |
| `POST /generate` | LLM tip candidates via LiteLLM `tip-generator` alias |
| `GET /stats/{user_id}` / `/stats/egreedy/{user_id}` / `/stats/egreedy/v2/{user_id}` | Per-user policy stats |
| `GET /features/{user_id}` | Last 100 scored feature vectors (ring buffer) |
| `POST /reset/{user_id}` | Clear all per-user bandit state (admin) |
| `GET /health` | `{ ok, nats: { enabled, consumers: { signals, feedback } } }` |
Called by `services/api/src/recommender/` over HTTP. Contract is stable across policy swaps.
## Feature dimensions
| Policy | d | Extra dims vs previous |
|--------|---|------------------------|
| LinUCB v1 | 5 | hour_sin/cos, is_overdue, task_age, priority |
| ε-greedy v1 | 7 | + dow_sin/cos |
| ε-greedy v2 | 12 | + 5 profile features (ADR-0012) |
Profile features are computed by the TypeScript API and shipped on each `/score` call as `profile_features`. See `ml/README.md` and ADR-0011.
## JetStream consumers
On startup, `nats_consumer.py` registers two durable push consumers against NATS JetStream:
| Consumer | Stream | Subjects | Durable name |
|----------|--------|----------|--------------|
| signals | `signals` | `signals.>` | `feature-pipeline-signals` |
| feedback | `feedback` | `feedback.>` | `feature-pipeline-feedback` |
**Handled subjects:**
- `signals.task.synced` — writes `{last_sync_ts, task_count}` to `{STATE_DIR}/{user}_sync.json`
- `signals.tip.feedback` — logged for observability; reward update happens via the HTTP path in the recommender
**Payload validation:** each message is validated against the pydantic models in `schemas.py` (mirroring `packages/shared-types/events/oo/events/v1/`). A `ValidationError` triggers a nak so the message is redelivered rather than silently dropped.
**Ack semantics:** explicit ack on success; nak for redelivery on error; dead-lettered after `NATS_MAX_DELIVER` attempts.
**Disabled** when `NATS_URL` is unset (default in local dev without NATS). No import of `nats-py` occurs in that case.
## Observability
Logs are structured JSON via **structlog**. Every line includes `level`, `logger`, `timestamp`, and — when a W3C `traceparent` header is present on the incoming request — `trace_id` bound via Python `contextvars`, so all log lines within a request carry the same trace ID as the upstream API call.
Sentry error capture is active when `SENTRY_DSN` is set.
## Config
| Env var | Default | Description |
|---------|---------|-------------|
| `STATE_DIR` | `/tmp/oo-bandit-state` | Directory for per-user bandit state JSON files |
| `LITELLM_URL` | `http://localhost:4000` | LiteLLM gateway |
| `LITELLM_MASTER_KEY` | `sk-oo-dev` | LiteLLM auth key |
| `NATS_URL` | `` | NATS broker URL; empty = consumers disabled |
| `NATS_DURABLE_PREFIX` | `feature-pipeline` | Prefix for durable consumer names |
| `NATS_MAX_DELIVER` | `5` | Max redelivery attempts before dropping |
| `DEFAULT_PROMPT_VERSION` | `v1` | Fallback prompt version for `/generate` |
| `ENV` | `development` | Environment label (passed to Sentry) |
| `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
## Health story
`GET /health` returns `{ ok: true }` plus NATS consumer state:
```json
{
"ok": true,
"nats": {
"enabled": true,
"consumers": {
"signals": { "last_msg_ts": "2026-04-25T10:00:00Z", "processed": 42, "errors": 0 },
"feedback": { "last_msg_ts": null, "processed": 0, "errors": 0 }
}
}
}
```
`last_msg_ts` is `null` until the first message arrives. Used by docker-compose healthcheck.
## Extraction criteria
Extract to its own process (already is one). Extract to a dedicated host / GPU node when:
- p99 scoring latency exceeds 50 ms under load, **or**
- model weights are too large to share memory with the Python process on the current host.
## State
Per-user bandit state is stored as JSON files in `STATE_DIR`:
| File pattern | Policy |
|---|---|
| `{user}.json` | LinUCB v1 |
| `{user}_egreedy.json` | ε-greedy v1 |
| `{user}_egreedy_v2.json` | ε-greedy v2 |
| `{user}_sync.json` | Last task sync metadata (written by JetStream consumer) |

View File

@@ -0,0 +1,20 @@
"""Structlog JSON configuration — import once at process start."""
import logging
import structlog
def configure() -> None:
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.stdlib.add_log_level,
structlog.stdlib.add_logger_name,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
context_class=dict,
logger_factory=structlog.PrintLoggerFactory(),
)
logging.basicConfig(level=logging.WARNING)

View File

@@ -28,17 +28,55 @@ import math
import os import os
import time import time
from collections import deque from collections import deque
from contextlib import asynccontextmanager
from pathlib import Path from pathlib import Path
from typing import Optional, Deque from typing import Optional, Deque
import httpx import httpx
import numpy as np import numpy as np
from fastapi import FastAPI, HTTPException import sentry_sdk
import structlog
import structlog.contextvars
from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel from pydantic import BaseModel
from starlette.middleware.base import BaseHTTPMiddleware
import logging_config
import nats_consumer
from prompts import get_prompt from prompts import get_prompt
app = FastAPI(title="oO ML Serving", version="1.0.0") logging_config.configure()
_SENTRY_DSN = os.getenv("SENTRY_DSN")
if _SENTRY_DSN:
sentry_sdk.init(dsn=_SENTRY_DSN, environment=os.getenv("ENV", "development"))
log = structlog.get_logger()
@asynccontextmanager
async def lifespan(app: FastAPI):
await nats_consumer.start(STATE_DIR)
yield
await nats_consumer.stop()
app = FastAPI(title="oO ML Serving", version="1.0.0", lifespan=lifespan)
class _TracingMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
structlog.contextvars.clear_contextvars()
traceparent = request.headers.get("traceparent", "")
if traceparent:
parts = traceparent.split("-")
trace_id = parts[1] if len(parts) == 4 and len(parts[1]) == 32 else None
if trace_id:
structlog.contextvars.bind_contextvars(trace_id=trace_id)
return await call_next(request)
app.add_middleware(_TracingMiddleware)
LITELLM_URL = os.getenv("LITELLM_URL", "http://localhost:4000") LITELLM_URL = os.getenv("LITELLM_URL", "http://localhost:4000")
LITELLM_MASTER_KEY = os.getenv("LITELLM_MASTER_KEY", "sk-oo-dev") LITELLM_MASTER_KEY = os.getenv("LITELLM_MASTER_KEY", "sk-oo-dev")
@@ -315,7 +353,13 @@ class GenerateResponse(BaseModel):
@app.get("/health") @app.get("/health")
def health(): def health():
return {"ok": True} return {
"ok": True,
"nats": {
"enabled": bool(nats_consumer.NATS_URL),
"consumers": nats_consumer.consumer_health,
},
}
_RETRY_SUFFIX = ( _RETRY_SUFFIX = (

146
ml/serving/nats_consumer.py Normal file
View File

@@ -0,0 +1,146 @@
"""
JetStream durable consumers for ml/serving.
Streams:
signals (subjects: signals.>) — durable: {prefix}-signals
feedback (subjects: feedback.>) — durable: {prefix}-feedback
Handled subjects:
signals.task.synced → write per-user sync metadata to STATE_DIR
signals.tip.feedback → log for observability (reward is applied via HTTP path)
Config (env vars):
NATS_URL — broker URL; empty = consumers disabled (default: "")
NATS_DURABLE_PREFIX — prefix for durable consumer names (default: "feature-pipeline")
NATS_MAX_DELIVER — max redelivery attempts before dropping (default: 5)
"""
from __future__ import annotations
import json
import os
import time
from pathlib import Path
from typing import Optional
import structlog
from schemas import TaskSyncedPayload, TipFeedbackPayload
log = structlog.get_logger(__name__)
NATS_URL = os.getenv("NATS_URL", "")
NATS_DURABLE_PREFIX = os.getenv("NATS_DURABLE_PREFIX", "feature-pipeline")
NATS_MAX_DELIVER = int(os.getenv("NATS_MAX_DELIVER", "5"))
# Exposed to /health
consumer_health: dict[str, dict] = {
"signals": {"last_msg_ts": None, "processed": 0, "errors": 0},
"feedback": {"last_msg_ts": None, "processed": 0, "errors": 0},
}
_nc = None # nats.aio.Client
_subs: list = [] # active JetStream subscriptions
# ── Subject handlers ───────────────────────────────────────────────────────
def _sync_meta_path(state_dir: Path, user_id: str) -> Path:
safe = "".join(c if c.isalnum() else "_" for c in user_id)
return state_dir / f"{safe}_sync.json"
async def _handle(subject: str, payload: dict, state_dir: Path) -> None:
if subject == "signals.task.synced":
msg = TaskSyncedPayload.model_validate(payload)
p = _sync_meta_path(state_dir, msg.userId)
p.write_text(json.dumps({
"last_sync_ts": msg.syncedAt,
"task_count": msg.count,
}))
log.info("nats: task_synced", user_id=msg.userId, count=msg.count)
elif subject == "signals.tip.feedback":
msg = TipFeedbackPayload.model_validate(payload)
log.info("nats: tip_feedback", user_id=msg.userId, tip_id=msg.tipId, action=msg.action, reward=msg.reward)
else:
log.debug("nats: unhandled subject", subject=subject)
# ── Consumer factory ───────────────────────────────────────────────────────
def _make_handler(key: str, state_dir: Path):
"""Return an async push-consumer callback that acks on success, naks on error."""
async def handler(msg) -> None:
consumer_health[key]["last_msg_ts"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
try:
payload = json.loads(msg.data)
await _handle(msg.subject, payload, state_dir)
await msg.ack()
consumer_health[key]["processed"] += 1
except Exception as exc:
consumer_health[key]["errors"] += 1
log.warning("nats: processing error", key=key, subject=msg.subject, exc=str(exc))
await msg.nak()
return handler
# ── Lifecycle ──────────────────────────────────────────────────────────────
async def start(state_dir: Path) -> None:
"""Connect to NATS and register durable push consumers. No-op if NATS_URL is unset."""
global _nc
if not NATS_URL:
log.info("nats: NATS_URL unset — JetStream consumers disabled")
return
try:
import nats as nats_lib
from nats.js.api import ConsumerConfig, AckPolicy
_nc = await nats_lib.connect(
NATS_URL,
name="ml-serving",
reconnect_time_wait=5,
max_reconnect_attempts=-1,
)
js = _nc.jetstream()
log.info("nats: connected", url=NATS_URL)
except Exception as exc:
log.warning("nats: connection failed — consumers disabled", exc=str(exc))
_nc = None
return
config = ConsumerConfig(
ack_policy=AckPolicy.EXPLICIT,
max_deliver=NATS_MAX_DELIVER,
)
for key, subject in [("signals", "signals.>"), ("feedback", "feedback.>")]:
durable = f"{NATS_DURABLE_PREFIX}-{key}"
try:
sub = await js.subscribe(
subject,
durable=durable,
cb=_make_handler(key, state_dir),
config=config,
)
_subs.append(sub)
log.info("nats: subscribed", subject=subject, durable=durable)
except Exception as exc:
log.warning("nats: subscribe failed", key=key, exc=str(exc))
async def stop() -> None:
"""Drain subscriptions and close NATS connection."""
global _nc
for sub in _subs:
try:
await sub.unsubscribe()
except Exception:
pass
_subs.clear()
if _nc:
try:
await _nc.drain()
except Exception:
pass
_nc = None
log.info("nats: disconnected")

View File

@@ -4,3 +4,6 @@ pydantic==2.10.4
numpy>=1.26.0 numpy>=1.26.0
httpx>=0.27.0 httpx>=0.27.0
anthropic>=0.40.0 anthropic>=0.40.0
nats-py>=2.9.0
structlog>=24.1.0
sentry-sdk>=2.0.0

50
ml/serving/schemas.py Normal file
View File

@@ -0,0 +1,50 @@
"""
Pydantic models mirroring oo.events.v1 proto schemas.
Field names use camelCase to match the proto3 JSON mapping convention
and the TypeScript payload shapes published by services/api.
Keep in sync with packages/shared-types/events/oo/events/v1/.
"""
from __future__ import annotations
from typing import Literal, Optional
from pydantic import BaseModel
class TaskSyncedPayload(BaseModel):
userId: str
source: str
count: int
syncedAt: str
class TipServedPayload(BaseModel):
userId: str
tipId: str
policy: str
servedAt: str
class TipFeedbackPayload(BaseModel):
userId: str
tipId: str
action: Literal['done', 'dismiss', 'snooze', 'helpful', 'not_helpful']
reward: float
dwellMs: Optional[int] = None
createdAt: str
class TipRewardFailedPayload(BaseModel):
userId: str
tipId: str
reward: float
attempts: int
error: str
failedAt: str
class IntegrationTokenExpiredPayload(BaseModel):
userId: str
provider: str
detectedAt: str

View File

@@ -0,0 +1,169 @@
"""
Tests for schemas.py and nats_consumer._handle.
"""
import json
import pytest
import tempfile
from pathlib import Path
from pydantic import ValidationError
from unittest.mock import AsyncMock
from schemas import (
TaskSyncedPayload,
TipServedPayload,
TipFeedbackPayload,
TipRewardFailedPayload,
IntegrationTokenExpiredPayload,
)
from nats_consumer import _handle, _sync_meta_path
# ── Schema validation ─────────────────────────────────────────────────────────
class TestTaskSyncedPayload:
def test_valid(self):
p = TaskSyncedPayload.model_validate(
{"userId": "u1", "source": "todoist", "count": 5, "syncedAt": "2026-04-25T10:00:00Z"}
)
assert p.userId == "u1"
assert p.count == 5
def test_missing_field_raises(self):
with pytest.raises(ValidationError):
TaskSyncedPayload.model_validate({"userId": "u1", "source": "todoist"})
def test_wrong_type_raises(self):
with pytest.raises(ValidationError):
TaskSyncedPayload.model_validate(
{"userId": "u1", "source": "todoist", "count": "not-an-int", "syncedAt": "2026-04-25T10:00:00Z"}
)
class TestTipFeedbackPayload:
def test_valid_without_dwell(self):
p = TipFeedbackPayload.model_validate(
{"userId": "u1", "tipId": "t1", "action": "done", "reward": 1.0, "createdAt": "2026-04-25T10:00:00Z"}
)
assert p.dwellMs is None
def test_valid_with_dwell(self):
p = TipFeedbackPayload.model_validate(
{"userId": "u1", "tipId": "t1", "action": "helpful", "reward": 0.5,
"dwellMs": 3200, "createdAt": "2026-04-25T10:00:00Z"}
)
assert p.dwellMs == 3200
def test_invalid_action_raises(self):
with pytest.raises(ValidationError):
TipFeedbackPayload.model_validate(
{"userId": "u1", "tipId": "t1", "action": "like", "reward": 1.0, "createdAt": "2026-04-25T10:00:00Z"}
)
def test_all_valid_actions(self):
for action in ("done", "dismiss", "snooze", "helpful", "not_helpful"):
p = TipFeedbackPayload.model_validate(
{"userId": "u1", "tipId": "t1", "action": action, "reward": 0.0, "createdAt": "2026-04-25T10:00:00Z"}
)
assert p.action == action
class TestOtherPayloads:
def test_tip_served(self):
p = TipServedPayload.model_validate(
{"userId": "u1", "tipId": "t1", "policy": "egreedy-v2", "servedAt": "2026-04-25T10:00:00Z"}
)
assert p.policy == "egreedy-v2"
def test_tip_reward_failed(self):
p = TipRewardFailedPayload.model_validate(
{"userId": "u1", "tipId": "t1", "reward": 1.0, "attempts": 3,
"error": "timeout", "failedAt": "2026-04-25T10:00:00Z"}
)
assert p.attempts == 3
def test_integration_token_expired(self):
p = IntegrationTokenExpiredPayload.model_validate(
{"userId": "u1", "provider": "todoist", "detectedAt": "2026-04-25T10:00:00Z"}
)
assert p.provider == "todoist"
# ── _handle behaviour ─────────────────────────────────────────────────────────
TASK_SYNCED = {
"userId": "user-abc",
"source": "todoist",
"count": 7,
"syncedAt": "2026-04-25T10:00:00Z",
}
TIP_FEEDBACK = {
"userId": "user-abc",
"tipId": "tip-xyz",
"action": "done",
"reward": 1.0,
"dwellMs": 4200,
"createdAt": "2026-04-25T10:00:00Z",
}
class TestHandle:
@pytest.mark.asyncio
async def test_task_synced_writes_meta_file(self):
with tempfile.TemporaryDirectory() as tmp:
state_dir = Path(tmp)
await _handle("signals.task.synced", TASK_SYNCED, state_dir)
meta_path = _sync_meta_path(state_dir, "user-abc")
assert meta_path.exists()
data = json.loads(meta_path.read_text())
assert data["task_count"] == 7
assert data["last_sync_ts"] == "2026-04-25T10:00:00Z"
@pytest.mark.asyncio
async def test_task_synced_bad_payload_raises(self):
with tempfile.TemporaryDirectory() as tmp:
with pytest.raises(ValidationError):
await _handle("signals.task.synced", {"userId": "u1"}, Path(tmp))
@pytest.mark.asyncio
async def test_tip_feedback_valid_does_not_raise(self):
with tempfile.TemporaryDirectory() as tmp:
# should log and return cleanly
await _handle("signals.tip.feedback", TIP_FEEDBACK, Path(tmp))
@pytest.mark.asyncio
async def test_tip_feedback_bad_action_raises(self):
bad = {**TIP_FEEDBACK, "action": "unknown"}
with tempfile.TemporaryDirectory() as tmp:
with pytest.raises(ValidationError):
await _handle("signals.tip.feedback", bad, Path(tmp))
@pytest.mark.asyncio
async def test_unhandled_subject_is_ignored(self):
with tempfile.TemporaryDirectory() as tmp:
# should not raise for unknown subjects
await _handle("signals.something.new", {"any": "data"}, Path(tmp))
@pytest.mark.asyncio
async def test_make_handler_acks_on_success(self):
from nats_consumer import _make_handler
with tempfile.TemporaryDirectory() as tmp:
handler = _make_handler("signals", Path(tmp))
msg = AsyncMock()
msg.subject = "signals.task.synced"
msg.data = json.dumps(TASK_SYNCED).encode()
await handler(msg)
msg.ack.assert_awaited_once()
msg.nak.assert_not_awaited()
@pytest.mark.asyncio
async def test_make_handler_naks_on_validation_error(self):
from nats_consumer import _make_handler
with tempfile.TemporaryDirectory() as tmp:
handler = _make_handler("signals", Path(tmp))
msg = AsyncMock()
msg.subject = "signals.task.synced"
msg.data = json.dumps({"userId": "u1"}).encode() # missing fields
await handler(msg)
msg.nak.assert_awaited_once()
msg.ack.assert_not_awaited()

View File

@@ -0,0 +1,63 @@
# @oo/shared-types
Canonical contracts for all inter-module communication. Two surfaces:
| Surface | Format | Location |
|---------|--------|----------|
| HTTP (sync) | OpenAPI / TypeScript interfaces | `src/http/` |
| Events (async) | Protocol Buffers + TS interfaces | `src/events/`, `events/` |
## HTTP types
Hand-written TypeScript interfaces generated from OpenAPI specs. Imported by
`services/api`, `apps/web`, and `ml/serving` (Python hand-mirrors).
| File | Types |
|------|-------|
| `src/http/tip.ts` | `TipCandidate`, `RecommendResponse`, `TipFeedback` |
| `src/http/auth.ts` | `SessionUser` |
| `src/http/integrations.ts` | `IntegrationsResponse`, `Integration` |
| `src/http/user.ts` | `UserProfile` |
| `src/http/signal.ts` | `Signal`, `SignalSource` |
## Event types
Protobuf schemas live in `events/oo/events/v1/`. TypeScript interfaces in
`src/events/index.ts` mirror the proto envelope and payload types.
| Proto file | Messages |
|------------|----------|
| `envelope.proto` | `Envelope` (wraps every event) |
| `signals.proto` | `TaskSyncedPayload`, `TipServedPayload`, `TipFeedbackPayload`, `TipRewardFailedPayload` |
| `integration.proto` | `IntegrationTokenExpiredPayload` |
**Schema evolution rules (ADR-0005):**
- Additive changes only within a version (new fields, new message types).
- Removed fields must be marked `reserved` — never reuse a field number.
- Breaking changes require a new package version (`oo.events.v2`) and a `schemaVersion` bump in the envelope.
## Schema registry / CI gate
`buf` enforces lint and breaking-change detection on every PR that touches `events/`:
```bash
# Lint
buf lint events/
# Breaking-change check against main
buf breaking events/ --against '.git#branch=main,subdir=packages/shared-types/events'
```
Local shortcut: `./scripts/buf-check.sh`
CI: `.gitea/workflows/buf-check.yaml` (requires a Gitea Actions runner).
Install buf: `curl -sSfL https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64 -o /usr/local/bin/buf && chmod +x /usr/local/bin/buf`
## Contract
`/health` — not applicable (library package, no process).
**Extraction criteria** — always a shared library. Extract to a separate registry
service only when schema governance requires independent versioning and deployment
(e.g. external consumers, SLA divergence from the monorepo).

View File

@@ -0,0 +1,7 @@
version: v1
lint:
use:
- STANDARD
breaking:
use:
- FILE

View File

@@ -0,0 +1,25 @@
syntax = "proto3";
package oo.events.v1;
import "oo/events/v1/signals.proto";
import "oo/events/v1/integration.proto";
// Envelope wraps every event on the bus and on NATS JetStream.
// Wire format: proto3 JSON (camelCase field names).
// schema_version = "v1" — bump to "v2" only for breaking payload changes.
message Envelope {
string event_id = 1; // UUID assigned by bus on publish
string occurred_at = 2; // ISO 8601
string schema_version = 3; // "v1"
string producer = 4; // e.g. "services/api"
string subject = 5; // NATS-style subject: domain.entity.verb
uint64 seq = 6; // monotonic sequence from the bus ring
oneof payload {
TaskSyncedPayload task_synced = 10;
TipServedPayload tip_served = 11;
TipFeedbackPayload tip_feedback = 12;
TipRewardFailedPayload tip_reward_failed = 13;
IntegrationTokenExpiredPayload integration_token_expired = 14;
}
}

View File

@@ -0,0 +1,9 @@
syntax = "proto3";
package oo.events.v1;
// subject: signals.integration.token_expired
message IntegrationTokenExpiredPayload {
string user_id = 1;
string provider = 2;
string detected_at = 3; // ISO 8601
}

View File

@@ -0,0 +1,39 @@
syntax = "proto3";
package oo.events.v1;
// subject: signals.task.synced
message TaskSyncedPayload {
string user_id = 1;
string source = 2; // e.g. "todoist"
int32 count = 3;
string synced_at = 4; // ISO 8601
}
// subject: signals.tip.served
message TipServedPayload {
string user_id = 1;
string tip_id = 2;
string policy = 3;
string served_at = 4; // ISO 8601
}
// subject: signals.tip.feedback
// action: done | dismiss | snooze | helpful | not_helpful
message TipFeedbackPayload {
string user_id = 1;
string tip_id = 2;
string action = 3;
double reward = 4;
optional int64 dwell_ms = 5; // null when no dwell was recorded
string created_at = 6; // ISO 8601
}
// subject: signals.tip.reward_failed
message TipRewardFailedPayload {
string user_id = 1;
string tip_id = 2;
double reward = 3;
int32 attempts = 4;
string error = 5;
string failed_at = 6; // ISO 8601
}

View File

@@ -15,7 +15,9 @@
"test": "vitest run", "test": "vitest run",
"test:watch": "vitest", "test:watch": "vitest",
"type-check": "tsc --noEmit", "type-check": "tsc --noEmit",
"clean": "rm -rf dist" "clean": "rm -rf dist",
"buf:lint": "buf lint events",
"buf:breaking": "buf breaking events --against '.git#branch=main,subdir=packages/shared-types/events'"
}, },
"devDependencies": { "devDependencies": {
"@vitest/coverage-v8": "^4.1.4", "@vitest/coverage-v8": "^4.1.4",

View File

@@ -1,6 +1,6 @@
/** /**
* NormalizedEvent — the durable envelope for all events flowing through * NormalizedEvent — the durable envelope for all events flowing through
* the system. Today: in-process EventEmitter. Tomorrow: NATS JetStream. * the system. Mirrors oo.events.v1.Envelope in packages/shared-types/events/.
* *
* Subject taxonomy: * Subject taxonomy:
* signals.task.synced — Todoist (or other source) task list refreshed * signals.task.synced — Todoist (or other source) task list refreshed
@@ -10,10 +10,16 @@
* signals.integration.token_expired — OAuth token needs reconnect * signals.integration.token_expired — OAuth token needs reconnect
*/ */
export interface NormalizedEvent<T = unknown> { export interface NormalizedEvent<T = unknown> {
/** UUID assigned by bus on publish */
eventId: string;
/** NATS-style subject: domain.entity.verb */ /** NATS-style subject: domain.entity.verb */
subject: string; subject: string;
/** ISO 8601 timestamp */ /** ISO 8601 timestamp */
ts: string; occurredAt: string;
/** "v1" — bump for breaking payload changes; see packages/shared-types/events/ */
schemaVersion: 'v1';
/** e.g. "services/api" */
producer: string;
/** Monotonically increasing sequence number (in-process ring; JetStream seq in prod) */ /** Monotonically increasing sequence number (in-process ring; JetStream seq in prod) */
seq: number; seq: number;
payload: T; payload: T;

View File

@@ -4,5 +4,6 @@
"outDir": "dist", "outDir": "dist",
"rootDir": "src" "rootDir": "src"
}, },
"include": ["src"] "include": ["src"],
"exclude": ["src/__tests__", "**/*.test.ts"]
} }

877
pnpm-lock.yaml generated

File diff suppressed because it is too large Load Diff

24
scripts/buf-check.sh Executable file
View File

@@ -0,0 +1,24 @@
#!/usr/bin/env bash
# Run buf lint and breaking-change detection locally.
# Usage: ./scripts/buf-check.sh [against-branch]
# Default against-branch: main
set -euo pipefail
AGAINST="${1:-main}"
ROOT="$(cd "$(dirname "$0")/.." && pwd)"
EVENTS="$ROOT/packages/shared-types/events"
if ! command -v buf &>/dev/null; then
echo "buf not found. Install: https://buf.build/docs/installation"
echo " curl -sSfL https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64 -o /usr/local/bin/buf && chmod +x /usr/local/bin/buf"
exit 1
fi
echo "==> buf lint"
buf lint "$EVENTS"
echo "==> buf breaking against $AGAINST"
buf breaking "$EVENTS" \
--against ".git#branch=${AGAINST},subdir=packages/shared-types/events"
echo "All checks passed."

91
services/api/README.md Normal file
View File

@@ -0,0 +1,91 @@
# services/api
Express BFF that serves all client-facing routes, manages sessions, runs background signal sync, and proxies admin calls to `ml/serving`.
## Contract
```
GET /health { ok: true }
POST /api/auth/login → redirect to Google OAuth
GET /api/auth/callback OAuth return URL
POST /api/auth/logout
GET /api/auth/session → { user? }
POST /api/auth/token { token } → set sid cookie (ADMIN_TOKEN auth)
GET /api/integrations list connected integrations
POST /api/integrations/todoist/connect start Todoist OAuth
GET /api/integrations/todoist/callback
DELETE /api/integrations/:provider disconnect
POST /api/recommend → { tip }
POST /api/tip/:id/feedback { action } → { ok }
GET /api/user/profile
DELETE /api/user account deletion
POST /api/push/subscribe
DELETE /api/push/subscribe
GET /api/admin/stats DAU/WAU, feedback breakdown
GET /api/admin/users
GET /api/admin/events recent event stream (ring buffer)
GET /api/admin/sim/runs offline sim run list
POST /api/admin/sim/run launch offline sim
GET /api/admin/sim/runs/:id/output tail sim stdout
...
GET /api/ml/* admin-only proxy to ml/serving
```
## Middleware stack (request order)
1. `cors` — origin limited to `WEB_BASE_URL`
2. `tracingMiddleware` — reads or generates W3C `traceparent`; sets `req.traceId` + `req.traceparent`
3. `pinoHttp` — structured JSON request/response logs with `traceId` field; `/health` suppressed
4. `express.json()` / `cookieParser`
5. `sessionMiddleware` — validates `sid` cookie, attaches `req.userId`
## Observability
Logs are structured JSON via **pino**. Every line includes `traceId` (extracted from the incoming W3C `traceparent` header, or generated fresh). The same `traceparent` is forwarded on all outbound HTTP calls to `ml/serving` so traces correlate end-to-end.
Sentry error capture is active when `SENTRY_DSN` is set.
## Background tasks
- **Todoist sync scheduler** — runs every `TODOIST_SYNC_INTERVAL_MS` (default 15 min); starts 10 s after boot to avoid startup surge.
- **Retention purge** — deletes `tipScores` and `tipFeedback` rows older than 30 days; runs on boot and daily.
- **Profile TTL invalidation** — listens to `signals.task.synced` and `signals.tip.feedback` on the in-process Bus; invalidates cached user-level profile features so the next `/recommend` gets fresh values.
## Config
| Env var | Default | Description |
|---------|---------|-------------|
| `PORT` | `3001` | Listen port |
| `NODE_ENV` | `development` | Environment label |
| `DATABASE_PATH` | `./data/oo.db` | SQLite file |
| `SESSION_SECRET` | required | Cookie signing secret |
| `GOOGLE_CLIENT_ID/SECRET` | required | OAuth |
| `TODOIST_CLIENT_ID/SECRET` | required | OAuth |
| `API_BASE_URL` | `http://localhost:3001` | Self-referential redirect URI |
| `WEB_BASE_URL` | `http://localhost:3000` | CORS + post-login redirect |
| `ML_SERVING_URL` | `http://localhost:8000` | ml/serving base URL |
| `NATS_URL` | `` | NATS broker; empty = in-process bus only |
| `TODOIST_SYNC_INTERVAL_MS` | `900000` | Background sync cadence |
| `TIP_PROMPT_VERSION` | `` | Prompt variant(s) for `/generate` |
| `LOG_LEVEL` | `info` | pino log level |
| `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
| `VAPID_*` | | Web push keys |
| `ADMIN_TOKEN` | `` | Static token for service/Playwright admin auth; empty = disabled |
## Health story
`GET /health` returns `{ ok: true }`. No dependency checks — upstream deps (`ml/serving`, NATS) have their own health endpoints checked separately.
## Extraction criteria
Extract to its own host when:
- Auth session management needs a dedicated Redis/PG session store, **or**
- Background sync load (Todoist, future connectors) displaces API serving on the shared host, **or**
- Team boundary emerges between auth/BFF and recommender orchestration.

View File

@@ -16,6 +16,7 @@
}, },
"dependencies": { "dependencies": {
"@oo/shared-types": "workspace:*", "@oo/shared-types": "workspace:*",
"@sentry/node": "^10.50.0",
"better-sqlite3": "^11.8.1", "better-sqlite3": "^11.8.1",
"cookie-parser": "^1.4.7", "cookie-parser": "^1.4.7",
"cors": "^2.8.5", "cors": "^2.8.5",
@@ -27,6 +28,8 @@
"nats": "^2.29.3", "nats": "^2.29.3",
"node-fetch": "^3.3.2", "node-fetch": "^3.3.2",
"openid-client": "^6.3.4", "openid-client": "^6.3.4",
"pino": "^10.3.1",
"pino-http": "^11.0.0",
"web-push": "^3.6.7", "web-push": "^3.6.7",
"zod": "^3.24.1" "zod": "^3.24.1"
}, },

View File

@@ -34,6 +34,17 @@ export const config = {
ML_SERVING_URL: optional('ML_SERVING_URL', 'http://localhost:8000'), ML_SERVING_URL: optional('ML_SERVING_URL', 'http://localhost:8000'),
LITELLM_URL: optional('LITELLM_URL', 'http://localhost:4000'), LITELLM_URL: optional('LITELLM_URL', 'http://localhost:4000'),
MLFLOW_URL: optional('MLFLOW_URL', 'http://localhost:5000'),
AIRFLOW_URL: optional('AIRFLOW_URL', 'http://localhost:8080'),
AIRFLOW_API_USER: optional('AIRFLOW_API_USER', 'admin'),
AIRFLOW_API_PASSWORD: optional('AIRFLOW_API_PASSWORD', 'admin'),
/** Shared secret for internal Airflow→API callbacks. */
INTERNAL_API_TOKEN: optional('INTERNAL_API_TOKEN', ''),
/** Static token for automated/service access to the admin panel (e.g. Playwright tests). */
ADMIN_TOKEN: optional('ADMIN_TOKEN', ''),
VAPID_PUBLIC_KEY: optional('VAPID_PUBLIC_KEY', ''), VAPID_PUBLIC_KEY: optional('VAPID_PUBLIC_KEY', ''),
VAPID_PRIVATE_KEY: optional('VAPID_PRIVATE_KEY', ''), VAPID_PRIVATE_KEY: optional('VAPID_PRIVATE_KEY', ''),
VAPID_SUBJECT: optional('VAPID_SUBJECT', 'mailto:admin@localhost'), VAPID_SUBJECT: optional('VAPID_SUBJECT', 'mailto:admin@localhost'),

View File

@@ -156,6 +156,10 @@ export function runMigrations() {
`ALTER TABLE tip_scores ADD COLUMN prompt_version TEXT`, `ALTER TABLE tip_scores ADD COLUMN prompt_version TEXT`,
`ALTER TABLE tip_scores ADD COLUMN llm_model TEXT`, `ALTER TABLE tip_scores ADD COLUMN llm_model TEXT`,
`ALTER TABLE tip_scores ADD COLUMN tip_kind TEXT`, `ALTER TABLE tip_scores ADD COLUMN tip_kind TEXT`,
`ALTER TABLE sim_runs ADD COLUMN airflow_dag_run_id TEXT`,
`ALTER TABLE sim_runs ADD COLUMN mlflow_run_id TEXT`,
`ALTER TABLE sim_runs ADD COLUMN judge_mode TEXT NOT NULL DEFAULT 'rule'`,
`ALTER TABLE sim_runs ADD COLUMN n_policies INTEGER NOT NULL DEFAULT 2`,
]) { ]) {
try { sqlite.exec(stmt); } catch { /* column already exists */ } try { sqlite.exec(stmt); } catch { /* column already exists */ }
} }

View File

@@ -112,9 +112,13 @@ export const simRuns = sqliteTable('sim_runs', {
tasksPerRound: integer('tasks_per_round').notNull().default(8), tasksPerRound: integer('tasks_per_round').notNull().default(8),
useLlm: integer('use_llm', { mode: 'boolean' }).notNull().default(false), useLlm: integer('use_llm', { mode: 'boolean' }).notNull().default(false),
status: text('status').notNull().default('pending'), // 'pending'|'running'|'done'|'failed' status: text('status').notNull().default('pending'), // 'pending'|'running'|'done'|'failed'
judgeMode: text('judge_mode').notNull().default('rule'),
nPolicies: integer('n_policies').notNull().default(2),
summaryJson: text('summary_json'), // JSON: { [policy]: PolicySummary } summaryJson: text('summary_json'), // JSON: { [policy]: PolicySummary }
winner: text('winner'), winner: text('winner'),
personaBreakdownJson: text('persona_breakdown_json'), // JSON: { [persona]: { [policy]: {reward,n} } } personaBreakdownJson: text('persona_breakdown_json'), // JSON: { [persona]: { [policy]: {reward,n} } }
airflowDagRunId: text('airflow_dag_run_id'),
mlflowRunId: text('mlflow_run_id'),
createdAt: text('created_at').notNull(), createdAt: text('created_at').notNull(),
finishedAt: text('finished_at'), finishedAt: text('finished_at'),
}); });

View File

@@ -56,7 +56,7 @@ describe('EventBus — delivery', () => {
it('does not throw when publishing with no subscribers', () => { it('does not throw when publishing with no subscribers', () => {
const b = makeBus(); const b = makeBus();
expect(() => expect(() =>
b.publish('signals.task.synced', { userId: 'u', count: 3, syncedAt: '' }), b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 3, syncedAt: '' }),
).not.toThrow(); ).not.toThrow();
}); });
@@ -101,7 +101,7 @@ describe('EventBus — ring buffer / tail()', () => {
it('tail() filters by subject prefix', () => { it('tail() filters by subject prefix', () => {
const b = makeBus(); const b = makeBus();
b.publish('signals.tip.served', { userId: 'u', tipId: 't', policy: 'p', servedAt: '' }); b.publish('signals.tip.served', { userId: 'u', tipId: 't', policy: 'p', servedAt: '' });
b.publish('signals.task.synced', { userId: 'u', count: 1, syncedAt: '' }); b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 1, syncedAt: '' });
const tipEvents = b.tail({ subject: 'signals.tip' }); const tipEvents = b.tail({ subject: 'signals.tip' });
expect(tipEvents.every((e) => e.subject.startsWith('signals.tip'))).toBe(true); expect(tipEvents.every((e) => e.subject.startsWith('signals.tip'))).toBe(true);
@@ -178,7 +178,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
const hook = vi.fn(); const hook = vi.fn();
b.onPublish(hook); b.onPublish(hook);
const payload = { userId: 'u', count: 2, syncedAt: 'now' }; const payload = { userId: 'u', source: 'todoist', count: 2, syncedAt: 'now' };
b.publish('signals.task.synced', payload); b.publish('signals.task.synced', payload);
expect(hook).toHaveBeenCalledOnce(); expect(hook).toHaveBeenCalledOnce();
@@ -191,7 +191,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
b.onPublish(() => calls.push('a')); b.onPublish(() => calls.push('a'));
b.onPublish(() => calls.push('b')); b.onPublish(() => calls.push('b'));
b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' }); b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' });
expect(calls).toEqual(['a', 'b']); expect(calls).toEqual(['a', 'b']);
}); });
@@ -202,7 +202,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
b.onPublish(hook); b.onPublish(hook);
b.subscribe('signals.task.synced', sub); b.subscribe('signals.task.synced', sub);
b.publish('signals.task.synced', { userId: 'u', count: 1, syncedAt: '' }); b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 1, syncedAt: '' });
expect(hook).toHaveBeenCalledOnce(); expect(hook).toHaveBeenCalledOnce();
expect(sub).toHaveBeenCalledOnce(); expect(sub).toHaveBeenCalledOnce();
}); });
@@ -215,7 +215,7 @@ describe('EventBus — onPublish hook (NATS bridge contract)', () => {
throw new Error('boom'); throw new Error('boom');
}); });
expect(() => expect(() =>
b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' }), b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' }),
).toThrow('boom'); ).toThrow('boom');
}); });
}); });

View File

@@ -106,7 +106,7 @@ describe('connectNats — bridge bus → JetStream', () => {
await connectNats('nats://test:4222'); await connectNats('nats://test:4222');
const payload = { userId: 'u1', count: 7, syncedAt: '2026-01-01T00:00:00Z' }; const payload = { userId: 'u1', source: 'todoist', count: 7, syncedAt: '2026-01-01T00:00:00Z' };
bus.publish('signals.task.synced', payload); bus.publish('signals.task.synced', payload);
// Allow the queued microtask in the hook to flush. // Allow the queued microtask in the hook to flush.
@@ -121,16 +121,17 @@ describe('connectNats — bridge bus → JetStream', () => {
it('swallows JetStream publish errors so the in-process bus keeps working', async () => { it('swallows JetStream publish errors so the in-process bus keeps working', async () => {
const { connectNats } = await import('../nats.js'); const { connectNats } = await import('../nats.js');
const { logger } = await import('../../logger.js');
const { bus } = await import('../bus.js'); const { bus } = await import('../bus.js');
await connectNats('nats://test:4222'); await connectNats('nats://test:4222');
// Force the next js.publish to reject. // Force the next js.publish to reject.
lastJsPublish.mockRejectedValueOnce(new Error('jetstream down')); lastJsPublish.mockRejectedValueOnce(new Error('jetstream down'));
const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {}); const errSpy = vi.spyOn(logger, 'error');
expect(() => expect(() =>
bus.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' }), bus.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' }),
).not.toThrow(); ).not.toThrow();
// Wait a tick for the rejected promise's catch to run. // Wait a tick for the rejected promise's catch to run.
@@ -142,12 +143,16 @@ describe('connectNats — bridge bus → JetStream', () => {
describe('connectNats — failure mode', () => { describe('connectNats — failure mode', () => {
it('logs a warning and stays silent when connect rejects', async () => { it('logs a warning and stays silent when connect rejects', async () => {
const { connectNats } = await import('../nats.js'); const { connectNats } = await import('../nats.js');
const { logger } = await import('../../logger.js');
lastConnect.mockRejectedValueOnce(new Error('ECONNREFUSED')); lastConnect.mockRejectedValueOnce(new Error('ECONNREFUSED'));
const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {}); const warnSpy = vi.spyOn(logger, 'warn');
await expect(connectNats('nats://nope:4222')).resolves.toBeUndefined(); await expect(connectNats('nats://nope:4222')).resolves.toBeUndefined();
expect(warnSpy).toHaveBeenCalledWith(expect.stringContaining('connection failed')); expect(warnSpy).toHaveBeenCalledWith(
expect.objectContaining({ err: expect.anything() }),
expect.stringContaining('connection failed'),
);
}); });
}); });
@@ -156,7 +161,7 @@ describe('Bus.onPublish contract — used by NATS bridge', () => {
const b = new Bus(); const b = new Bus();
const hook = vi.fn(); const hook = vi.fn();
b.onPublish(hook); b.onPublish(hook);
b.publish('signals.task.synced', { userId: 'u', count: 0, syncedAt: '' }); b.publish('signals.task.synced', { userId: 'u', source: 'todoist', count: 0, syncedAt: '' });
expect(hook).toHaveBeenCalledOnce(); expect(hook).toHaveBeenCalledOnce();
}); });
}); });

View File

@@ -45,6 +45,7 @@ export type RewardDeliveryFailedEvent = {
export type TaskSyncedEvent = { export type TaskSyncedEvent = {
userId: string; userId: string;
source: string; // e.g. 'todoist'
count: number; count: number;
syncedAt: string; syncedAt: string;
}; };

View File

@@ -12,6 +12,7 @@
import type { NatsConnection, JetStreamClient, StreamConfig } from 'nats'; import type { NatsConnection, JetStreamClient, StreamConfig } from 'nats';
import { bus } from './bus.js'; import { bus } from './bus.js';
import { logger } from '../logger.js';
let nc: NatsConnection | null = null; let nc: NatsConnection | null = null;
let js: JetStreamClient | null = null; let js: JetStreamClient | null = null;
@@ -67,13 +68,13 @@ export async function connectNats(natsUrl: string): Promise<void> {
if (!js) return; if (!js) return;
const data = new TextEncoder().encode(JSON.stringify(payload)); const data = new TextEncoder().encode(JSON.stringify(payload));
js.publish(subject, data).catch((err: Error) => js.publish(subject, data).catch((err: Error) =>
console.error(`[nats] publish failed for ${subject}: ${err.message}`), logger.error({ err, subject }, 'nats publish failed'),
); );
}); });
console.log(`[nats] connected to ${natsUrl}, streams: ${STREAMS.map((s) => s.name).join(', ')}`); logger.info({ url: natsUrl, streams: STREAMS.map((s) => s.name) }, 'nats connected');
} catch (err: any) { } catch (err: any) {
console.warn(`[nats] connection failed — running without JetStream: ${err.message}`); logger.warn({ err }, 'nats connection failed — running without JetStream');
} }
} }

View File

@@ -1,7 +1,10 @@
import 'dotenv/config'; import 'dotenv/config';
import { logger } from './logger.js';
import express from 'express'; import express from 'express';
import { pinoHttp } from 'pino-http';
import cookieParser from 'cookie-parser'; import cookieParser from 'cookie-parser';
import cors from 'cors'; import cors from 'cors';
import { tracingMiddleware } from './middleware/tracing.js';
import { config } from './config.js'; import { config } from './config.js';
import { db, runMigrations } from './db/index.js'; import { db, runMigrations } from './db/index.js';
import { tipScores, tipFeedback } from './db/schema.js'; import { tipScores, tipFeedback } from './db/schema.js';
@@ -12,7 +15,7 @@ import { integrationsRouter } from './routes/integrations.js';
import { recommenderRouter } from './routes/recommender.js'; import { recommenderRouter } from './routes/recommender.js';
import { userRouter } from './routes/user.js'; import { userRouter } from './routes/user.js';
import { pushRouter } from './routes/push.js'; import { pushRouter } from './routes/push.js';
import { adminRouter } from './routes/admin.js'; import { adminRouter, adminInternalRouter } from './routes/admin.js';
import { mkdir } from 'fs/promises'; import { mkdir } from 'fs/promises';
import { dirname } from 'path'; import { dirname } from 'path';
import { requireAuth } from './middleware/session.js'; import { requireAuth } from './middleware/session.js';
@@ -26,13 +29,11 @@ import { registerProfileSubscriptions } from './profile/subscriber.js';
await mkdir(dirname(config.DATABASE_PATH), { recursive: true }); await mkdir(dirname(config.DATABASE_PATH), { recursive: true });
runMigrations(); runMigrations();
// Keep the API alive on stray async faults (e.g. a single bad admin route)
// rather than dropping the whole process.
process.on('unhandledRejection', (reason) => { process.on('unhandledRejection', (reason) => {
console.error('[api] unhandledRejection', reason); logger.error({ err: reason }, 'unhandledRejection');
}); });
process.on('uncaughtException', (err) => { process.on('uncaughtException', (err) => {
console.error('[api] uncaughtException', err); logger.fatal({ err }, 'uncaughtException');
}); });
const app = express(); const app = express();
@@ -43,6 +44,15 @@ app.use(
credentials: true, credentials: true,
}), }),
); );
app.use(tracingMiddleware);
app.use(
pinoHttp({
logger,
genReqId: (req) => req.traceId,
customProps: (req) => ({ traceId: req.traceId }),
autoLogging: { ignore: (req) => req.url === '/health' },
}),
);
app.use(express.json()); app.use(express.json());
app.use(cookieParser()); app.use(cookieParser());
app.use(sessionMiddleware); app.use(sessionMiddleware);
@@ -55,17 +65,15 @@ app.use('/api', recommenderRouter);
app.use('/api/user', userRouter); app.use('/api/user', userRouter);
app.use('/api/push', pushRouter); app.use('/api/push', pushRouter);
app.use('/api/admin', adminRouter); app.use('/api/admin', adminRouter);
app.use('/api/admin', adminInternalRouter);
// Proxy ml/serving endpoints through the API (admin-only).
// Allows admin UI to call /api/ml/stats/:userId, /api/ml/features/:userId
// without needing direct access to the ml/serving port.
app.use('/api/ml', requireAuth as any, requireAdmin as any, async (req: Request, res: Response) => { app.use('/api/ml', requireAuth as any, requireAdmin as any, async (req: Request, res: Response) => {
const mlUrl = config.ML_SERVING_URL; const mlUrl = config.ML_SERVING_URL;
const target = `${mlUrl}${req.path}`; const target = `${mlUrl}${req.path}`;
try { try {
const upstream = await fetch(target, { const upstream = await fetch(target, {
method: req.method, method: req.method,
headers: { 'Content-Type': 'application/json' }, headers: { 'Content-Type': 'application/json', traceparent: req.traceparent },
body: req.method !== 'GET' ? JSON.stringify(req.body) : undefined, body: req.method !== 'GET' ? JSON.stringify(req.body) : undefined,
signal: AbortSignal.timeout(5000), signal: AbortSignal.timeout(5000),
}); });
@@ -82,7 +90,7 @@ async function purgeExpiredData() {
await db.delete(tipScores).where(lt(tipScores.servedAt, cutoff)); await db.delete(tipScores).where(lt(tipScores.servedAt, cutoff));
await db.delete(tipFeedback).where(lt(tipFeedback.createdAt, cutoff)); await db.delete(tipFeedback).where(lt(tipFeedback.createdAt, cutoff));
} catch (err: any) { } catch (err: any) {
console.error(`[purge] retention cleanup failed: ${err.message}`); logger.error({ err }, 'retention cleanup failed');
} }
} }
@@ -90,7 +98,7 @@ purgeExpiredData();
setInterval(purgeExpiredData, 24 * 60 * 60 * 1000); setInterval(purgeExpiredData, 24 * 60 * 60 * 1000);
app.listen(config.PORT, () => { app.listen(config.PORT, () => {
console.log(`oO API listening on http://localhost:${config.PORT}`); logger.info({ port: config.PORT }, 'oO API listening');
}); });
if (config.NATS_URL) { if (config.NATS_URL) {

View File

@@ -0,0 +1,12 @@
import pino from 'pino';
import * as Sentry from '@sentry/node';
if (process.env['SENTRY_DSN']) {
Sentry.init({
dsn: process.env['SENTRY_DSN'],
environment: process.env['NODE_ENV'] ?? 'development',
});
}
export const logger = pino({ level: process.env['LOG_LEVEL'] ?? 'info' });
export { Sentry };

View File

@@ -0,0 +1,26 @@
import { randomBytes } from 'crypto';
import type { Request, Response, NextFunction } from 'express';
declare global {
namespace Express {
interface Request {
traceId: string;
traceparent: string;
}
}
}
export function tracingMiddleware(req: Request, _res: Response, next: NextFunction): void {
const incoming = req.headers['traceparent'] as string | undefined;
let traceId: string;
if (incoming) {
const parts = incoming.split('-');
traceId = parts.length === 4 && parts[1]?.length === 32 ? parts[1] : randomBytes(16).toString('hex');
} else {
traceId = randomBytes(16).toString('hex');
}
const parentId = randomBytes(8).toString('hex');
req.traceId = traceId;
req.traceparent = `00-${traceId}-${parentId}-01`;
next();
}

View File

@@ -4,7 +4,7 @@
* A real Express app + in-memory SQLite DB per test suite. * A real Express app + in-memory SQLite DB per test suite.
* Auth and admin middleware are mocked so we can focus on route logic. * Auth and admin middleware are mocked so we can focus on route logic.
*/ */
import { describe, it, expect, vi, beforeAll } from 'vitest'; import { describe, it, expect, vi, beforeAll, afterEach } from 'vitest';
import express from 'express'; import express from 'express';
import * as http from 'http'; import * as http from 'http';
import { makeTestDb } from '../../test/db.js'; import { makeTestDb } from '../../test/db.js';
@@ -385,16 +385,126 @@ describe('GET /api/admin/events', () => {
}); });
}); });
// ---------------------------------------------------------------------------
// Health endpoint — mock fetch so tests don't depend on running services.
// ---------------------------------------------------------------------------
describe('GET /api/admin/health', () => { describe('GET /api/admin/health', () => {
it('returns 200 with ok, services array, and checkedAt', async () => { const EXPECTED_HTTP_SERVICES = ['api', 'ml-serving', 'mlflow', 'airflow'] as const;
const EXPECTED_INTERNAL = ['sqlite', 'event-bus'] as const;
const VALID_STATUSES = new Set(['ok', 'degraded', 'down']);
type ServiceRow = { name: string; status: string; latencyMs: number };
type HealthBody = { ok: boolean; services: ServiceRow[]; checkedAt: string };
function mockFetch(upServices: Set<string>) {
// Resolve service name by port (matches defaults in config.ts).
// Up services return HTTP 200; absent ones throw (simulates connection refused → 'down').
vi.stubGlobal('fetch', async (url: string) => {
const s = String(url);
let name: string;
if (s.includes(':8000')) name = 'ml-serving';
else if (s.includes(':5000')) name = 'mlflow';
else if (s.includes(':8080')) name = 'airflow';
else name = 'api';
if (!upServices.has(name)) throw new Error(`ECONNREFUSED ${name}`);
return { ok: true, json: async () => ({ ok: true, status: 'healthy' }) };
});
}
afterEach(() => vi.unstubAllGlobals());
it('shape: 200, typed fields, all expected services present', async () => {
mockFetch(new Set(['api', 'ml-serving', 'mlflow', 'airflow']));
const { server, call } = await startServer(buildApp()); const { server, call } = await startServer(buildApp());
try { try {
const { status, body } = await call('GET', '/api/admin/health'); const { status, body } = await call('GET', '/api/admin/health');
const b = body as { ok: boolean; services: { name: string; status: string }[]; checkedAt: string }; const b = body as HealthBody;
expect(status).toBe(200); expect(status).toBe(200);
expect(typeof b.ok).toBe('boolean'); expect(typeof b.ok).toBe('boolean');
expect(Array.isArray(b.services)).toBe(true); expect(Array.isArray(b.services)).toBe(true);
expect(typeof b.checkedAt).toBe('string'); expect(typeof b.checkedAt).toBe('string');
expect(new Date(b.checkedAt).getTime()).toBeGreaterThan(0);
const names = b.services.map((s) => s.name);
for (const svc of [...EXPECTED_HTTP_SERVICES, ...EXPECTED_INTERNAL]) {
expect(names).toContain(svc);
}
for (const svc of b.services) {
expect(VALID_STATUSES).toContain(svc.status);
expect(typeof svc.latencyMs).toBe('number');
}
} finally {
server.close();
}
});
it('ok=true when all HTTP services respond 200', async () => {
mockFetch(new Set(['api', 'ml-serving', 'mlflow', 'airflow']));
const { server, call } = await startServer(buildApp());
try {
const { body } = await call('GET', '/api/admin/health');
const b = body as HealthBody;
for (const name of EXPECTED_HTTP_SERVICES) {
const svc = b.services.find((s) => s.name === name);
expect(svc?.status, `${name} should be ok`).toBe('ok');
}
expect(b.ok).toBe(true);
} finally {
server.close();
}
});
it('ml-serving=down and ok=false when ml-serving is unreachable', async () => {
mockFetch(new Set(['api', 'mlflow', 'airflow'])); // ml-serving absent
const { server, call } = await startServer(buildApp());
try {
const { body } = await call('GET', '/api/admin/health');
const b = body as HealthBody;
const mlSvc = b.services.find((s) => s.name === 'ml-serving');
expect(mlSvc?.status).toBe('down');
expect(b.ok).toBe(false);
} finally {
server.close();
}
});
it('airflow=down and ok=false when airflow is unreachable', async () => {
mockFetch(new Set(['api', 'ml-serving', 'mlflow'])); // airflow absent
const { server, call } = await startServer(buildApp());
try {
const { body } = await call('GET', '/api/admin/health');
const b = body as HealthBody;
const svc = b.services.find((s) => s.name === 'airflow');
expect(svc?.status).toBe('down');
expect(b.ok).toBe(false);
} finally {
server.close();
}
});
it('mlflow=down and ok=false when mlflow is unreachable', async () => {
mockFetch(new Set(['api', 'ml-serving', 'airflow'])); // mlflow absent
const { server, call } = await startServer(buildApp());
try {
const { body } = await call('GET', '/api/admin/health');
const b = body as HealthBody;
const svc = b.services.find((s) => s.name === 'mlflow');
expect(svc?.status).toBe('down');
expect(b.ok).toBe(false);
} finally {
server.close();
}
});
it('sqlite and event-bus are always present regardless of HTTP service status', async () => {
mockFetch(new Set()); // all HTTP services down
const { server, call } = await startServer(buildApp());
try {
const { body } = await call('GET', '/api/admin/health');
const b = body as HealthBody;
expect(b.services.find((s) => s.name === 'sqlite')?.status).toBe('ok');
expect(b.services.find((s) => s.name === 'event-bus')?.status).toBe('ok');
} finally { } finally {
server.close(); server.close();
} }

View File

@@ -1,4 +1,5 @@
import { type Router as ExpressRouter, Router, Response } from 'express'; import { type Router as ExpressRouter, Router, Response, type Request } from 'express';
import { logger } from '../logger.js';
import { db, rawSqlite } from '../db/index.js'; import { db, rawSqlite } from '../db/index.js';
import { import {
users, users,
@@ -523,16 +524,24 @@ router.get('/data-quality', async (req: AuthenticatedRequest, res: Response) =>
// Fan-out to all subsystem /health endpoints. // Fan-out to all subsystem /health endpoints.
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------
router.get('/health', async (_req: AuthenticatedRequest, res: Response) => { router.get('/health', async (_req: AuthenticatedRequest, res: Response) => {
const checks: Array<{ name: string; url: string }> = [ const airflowAuth = Buffer.from(`${config.AIRFLOW_API_USER}:${config.AIRFLOW_API_PASSWORD}`).toString('base64');
{ name: 'api', url: `http://localhost:${process.env.PORT ?? 3001}/health` },
const checks: Array<{ name: string; url: string; headers?: Record<string, string> }> = [
{ name: 'api', url: `http://localhost:${config.PORT}/health` },
{ name: 'ml-serving', url: `${config.ML_SERVING_URL}/health` }, { name: 'ml-serving', url: `${config.ML_SERVING_URL}/health` },
{ name: 'mlflow', url: `${config.MLFLOW_URL}/health` },
{ name: 'airflow', url: `${config.AIRFLOW_URL}/api/v1/health`,
headers: { Authorization: `Basic ${airflowAuth}` } },
]; ];
const results = await Promise.allSettled( const results = await Promise.allSettled(
checks.map(async ({ name, url }) => { checks.map(async ({ name, url, headers }) => {
const t0 = Date.now(); const t0 = Date.now();
try { try {
const r = await fetch(url, { signal: AbortSignal.timeout(3000) }); const r = await fetch(url, {
headers,
signal: AbortSignal.timeout(3000),
});
return { name, status: r.ok ? 'ok' : 'degraded', latencyMs: Date.now() - t0 }; return { name, status: r.ok ? 'ok' : 'degraded', latencyMs: Date.now() - t0 };
} catch { } catch {
return { name, status: 'down', latencyMs: Date.now() - t0 }; return { name, status: 'down', latencyMs: Date.now() - t0 };
@@ -548,15 +557,12 @@ router.get('/health', async (_req: AuthenticatedRequest, res: Response) => {
dbStatus = 'down'; dbStatus = 'down';
} }
// Event bus: always ok if process is alive
const eventBusStatus = 'ok';
const services = results.map((r) => const services = results.map((r) =>
r.status === 'fulfilled' ? r.value : { name: 'unknown', status: 'down', latencyMs: 0 }, r.status === 'fulfilled' ? r.value : { name: 'unknown', status: 'down', latencyMs: 0 },
); );
services.push({ name: 'sqlite', status: dbStatus, latencyMs: 0 }); services.push({ name: 'sqlite', status: dbStatus, latencyMs: 0 });
services.push({ name: 'event-bus', status: eventBusStatus, latencyMs: 0 }); services.push({ name: 'event-bus', status: 'ok', latencyMs: 0 });
const allOk = services.every((s) => s.status === 'ok'); const allOk = services.every((s) => s.status === 'ok');
res.json({ ok: allOk, services, checkedAt: new Date().toISOString() }); res.json({ ok: allOk, services, checkedAt: new Date().toISOString() });
@@ -699,22 +705,21 @@ router.delete('/saved-queries/:id', async (req: AuthenticatedRequest, res: Respo
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------
// POST /api/admin/simulate/start // POST /api/admin/simulate/start
// Spawn ml/experiments/sim/runner.py in the background; return run_id. // Trigger an Airflow DAG run (bandit_sim). Falls back to a local subprocess
// when AIRFLOW_URL is not reachable, so local dev still works.
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------
router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response) => { router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response) => {
const { const {
nUsers = 5, nUsers = 5,
nRounds = 20, nRounds = 20,
tasksPerRound = 8, tasksPerRound = 8,
useLlm = false,
judgeMode = 'rule', judgeMode = 'rule',
policies = ['linucb-v1', 'egreedy-v1'], policies = ['linucb-v1', 'egreedy-v1'],
} = req.body as { } = req.body as {
nUsers?: number; nUsers?: number;
nRounds?: number; nRounds?: number;
tasksPerRound?: number; tasksPerRound?: number;
useLlm?: boolean; judgeMode?: 'rule' | 'llm';
judgeMode?: 'rule' | 'llm' | 'claude-code';
policies?: string[]; policies?: string[];
}; };
@@ -733,17 +738,69 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
nUsers, nUsers,
nRounds, nRounds,
tasksPerRound, tasksPerRound,
useLlm, useLlm: judgeMode === 'llm',
judgeMode,
nPolicies: policies.length,
status: 'running', status: 'running',
createdAt: now, createdAt: now,
}); });
// ── Try Airflow first ────────────────────────────────────────────────────
if (config.AIRFLOW_URL && config.INTERNAL_API_TOKEN) {
try {
const airflowAuth = Buffer.from(
`${config.AIRFLOW_API_USER}:${config.AIRFLOW_API_PASSWORD}`,
).toString('base64');
const dagRes = await fetch(
`${config.AIRFLOW_URL}/api/v1/dags/bandit_sim/dagRuns`,
{
method: 'POST',
headers: {
'Content-Type': 'application/json',
Authorization: `Basic ${airflowAuth}`,
},
body: JSON.stringify({
conf: {
sim_run_id: id,
n_users: nUsers,
n_rounds: nRounds,
tasks_per_round: tasksPerRound,
policies,
judge_mode: judgeMode,
ml_url: config.ML_SERVING_URL,
mlflow_url: config.MLFLOW_URL,
callback_url: `${config.API_BASE_URL}/api/admin/simulate/${id}/complete`,
internal_token: config.INTERNAL_API_TOKEN,
},
}),
signal: AbortSignal.timeout(5000),
},
);
if (dagRes.ok) {
const dagBody = await dagRes.json() as { dag_run_id: string };
await db
.update(simRuns)
.set({ airflowDagRunId: dagBody.dag_run_id })
.where(eq(simRuns.id, id));
res.json({ id, status: 'running', airflow_dag_run_id: dagBody.dag_run_id });
return;
}
logger.warn({ status: dagRes.status }, 'sim: Airflow trigger failed, falling back to subprocess');
} catch (err) {
logger.warn({ err }, 'sim: Airflow unreachable, falling back to subprocess');
}
}
// ── Subprocess fallback (local dev / Airflow not configured) ────────────
const runnerPath = resolve(__dirname, '../../../../ml/experiments/sim/runner.py'); const runnerPath = resolve(__dirname, '../../../../ml/experiments/sim/runner.py');
const venvPython = resolve(__dirname, '../../../../ml/serving/.venv/bin/python'); const venvPython = resolve(__dirname, '../../../../ml/serving/.venv/bin/python');
const pythonBin = existsSync(venvPython) ? venvPython : 'python3'; const pythonBin = existsSync(venvPython) ? venvPython : 'python3';
const outPath = `/tmp/oo-sim-${id}.json`; const outPath = `/tmp/oo-sim-${id}.json`;
const args = [ const child = spawn(pythonBin, [
runnerPath, runnerPath,
'--n-users', String(nUsers), '--n-users', String(nUsers),
'--n-rounds', String(nRounds), '--n-rounds', String(nRounds),
@@ -751,32 +808,22 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
'--ml-url', config.ML_SERVING_URL, '--ml-url', config.ML_SERVING_URL,
'--policies', ...policies, '--policies', ...policies,
'--out', outPath, '--out', outPath,
'--judge', judgeMode === 'llm' ? 'llm' : judgeMode === 'claude-code' ? 'rule' : 'rule', '--judge', judgeMode,
// claude-code mode isn't auto-runnable from the API (requires human in the loop) '--mlflow-url', config.MLFLOW_URL,
// it falls back to rule judge when triggered from the panel '--mlflow-experiment', 'bandit_simulation',
]; ], { stdio: ['ignore', 'pipe', 'pipe'] });
const child = spawn(pythonBin, args, { stdio: ['ignore', 'pipe', 'pipe'] }); if (child.pid) _simProcesses.set(id, { pid: child.pid, startedAt: now });
if (child.pid) {
_simProcesses.set(id, { pid: child.pid, startedAt: now });
}
// Without this listener, a spawn failure (ENOENT when python3 is absent
// — e.g. in the alpine api container) would emit an unhandled 'error' event
// and crash the whole API process.
child.on('error', async (err) => { child.on('error', async (err) => {
console.error('[sim] spawn error', err); logger.error({ err }, 'sim: spawn error');
_simProcesses.delete(id); _simProcesses.delete(id);
await db await db.update(simRuns)
.update(simRuns)
.set({ status: 'failed', finishedAt: new Date().toISOString() }) .set({ status: 'failed', finishedAt: new Date().toISOString() })
.where(eq(simRuns.id, id)); .where(eq(simRuns.id, id));
}); });
// Capture stderr for debugging child.stderr?.on('data', (d: Buffer) => logger.debug({ stderr: d.toString() }, 'sim stderr'));
const stderrLines: string[] = [];
child.stderr?.on('data', (d: Buffer) => stderrLines.push(d.toString()));
child.on('exit', async (code) => { child.on('exit', async (code) => {
_simProcesses.delete(id); _simProcesses.delete(id);
@@ -785,8 +832,6 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
if (code === 0 && existsSync(outPath)) { if (code === 0 && existsSync(outPath)) {
try { try {
const raw = JSON.parse(readFileSync(outPath, 'utf-8')); const raw = JSON.parse(readFileSync(outPath, 'utf-8'));
// Bulk-insert sim events
const eventRows = (raw.events ?? []).map((ev: Record<string, unknown>) => ({ const eventRows = (raw.events ?? []).map((ev: Record<string, unknown>) => ({
id: nanoid(), id: nanoid(),
runId: id, runId: id,
@@ -804,21 +849,19 @@ router.post('/simulate/start', async (req: AuthenticatedRequest, res: Response)
dayOfWeek: Number(ev.day_of_week), dayOfWeek: Number(ev.day_of_week),
createdAt: now, createdAt: now,
})); }));
for (const row of eventRows) { for (const row of eventRows) {
await db.insert(simEvents).values(row).catch(() => {}); await db.insert(simEvents).values(row).catch(() => {});
} }
await db.update(simRuns).set({ await db.update(simRuns).set({
status: 'done', status: 'done',
summaryJson: JSON.stringify(raw.summary), summaryJson: JSON.stringify(raw.summary),
winner: raw.winner, winner: raw.winner,
personaBreakdownJson: JSON.stringify(raw.persona_breakdown), personaBreakdownJson: JSON.stringify(raw.persona_breakdown),
mlflowRunId: raw.mlflow_run_id ?? null,
finishedAt, finishedAt,
}).where(eq(simRuns.id, id)); }).where(eq(simRuns.id, id));
try { unlinkSync(outPath); } catch { /* ignore */ } try { unlinkSync(outPath); } catch { /* ignore */ }
} catch (e) { } catch {
await db.update(simRuns).set({ status: 'failed', finishedAt }).where(eq(simRuns.id, id)); await db.update(simRuns).set({ status: 'failed', finishedAt }).where(eq(simRuns.id, id));
} }
} else { } else {
@@ -863,4 +906,68 @@ router.get('/simulate/:id', async (req: AuthenticatedRequest, res: Response) =>
res.json({ run: { ...run, isRunning }, events }); res.json({ run: { ...run, isRunning }, events });
}); });
export { router as adminRouter }; // ---------------------------------------------------------------------------
// internalRouter — no session auth; only INTERNAL_API_TOKEN header check.
// Mounted separately in index.ts at /api/admin to avoid router.use() auth.
// ---------------------------------------------------------------------------
const internalRouter: ExpressRouter = Router();
internalRouter.post('/simulate/:id/complete', async (req: Request, res: Response) => {
const token = req.headers['x-internal-token'];
if (!config.INTERNAL_API_TOKEN || token !== config.INTERNAL_API_TOKEN) {
res.status(401).json({ error: 'Unauthorized' });
return;
}
const { id } = req.params as { id: string };
const { summary, winner, persona_breakdown, events: rawEvents, mlflow_run_id } =
req.body as {
summary: Record<string, unknown>;
winner: string;
persona_breakdown: Record<string, unknown>;
events: Record<string, unknown>[];
mlflow_run_id?: string;
};
const finishedAt = new Date().toISOString();
const now = finishedAt;
try {
const eventRows = (rawEvents ?? []).map((ev) => ({
id: nanoid(),
runId: id,
round: Number(ev['round']),
userId: String(ev['user_id']),
persona: String(ev['persona']),
policy: String(ev['policy']),
tipContent: String(ev['tip_content']),
priority: Number(ev['priority']),
isOverdue: Boolean(ev['is_overdue']),
action: String(ev['action']),
dwellMs: ev['dwell_ms'] != null ? Number(ev['dwell_ms']) : null,
rewardMilli: Math.round(Number(ev['reward']) * 1000),
hour: Number(ev['hour']),
dayOfWeek: Number(ev['day_of_week']),
createdAt: now,
}));
for (const row of eventRows) {
await db.insert(simEvents).values(row).catch(() => {});
}
await db.update(simRuns).set({
status: 'done',
summaryJson: JSON.stringify(summary),
winner,
personaBreakdownJson: JSON.stringify(persona_breakdown),
mlflowRunId: mlflow_run_id ?? null,
finishedAt,
}).where(eq(simRuns.id, id));
res.json({ ok: true });
} catch (err) {
logger.error({ err }, 'sim: complete callback failed');
await db.update(simRuns).set({ status: 'failed', finishedAt }).where(eq(simRuns.id, id));
res.status(500).json({ error: 'Failed to store results' });
}
});
export { router as adminRouter, internalRouter as adminInternalRouter };

View File

@@ -5,6 +5,7 @@ import { db } from '../db/index.js';
import { users, sessions } from '../db/schema.js'; import { users, sessions } from '../db/schema.js';
import { eq } from 'drizzle-orm'; import { eq } from 'drizzle-orm';
import { config } from '../config.js'; import { config } from '../config.js';
import { logger } from '../logger.js';
const router: ExpressRouter = Router(); const router: ExpressRouter = Router();
@@ -36,7 +37,7 @@ router.get('/login', async (req: Request, res: Response) => {
setTimeout(() => pendingStates.delete(state), 10 * 60 * 1000); setTimeout(() => pendingStates.delete(state), 10 * 60 * 1000);
const redirectUri = `${config.API_BASE_URL}/api/auth/callback`; const redirectUri = `${config.API_BASE_URL}/api/auth/callback`;
console.log('[auth] redirect_uri sent to Google:', redirectUri); logger.info({ redirectUri }, 'auth: redirect_uri');
const authUrl = client.buildAuthorizationUrl(cfg, { const authUrl = client.buildAuthorizationUrl(cfg, {
redirect_uri: redirectUri, redirect_uri: redirectUri,
scope: 'openid email profile', scope: 'openid email profile',
@@ -72,7 +73,7 @@ router.get('/callback', async (req: Request, res: Response) => {
expectedState: state, expectedState: state,
}); });
} catch (err) { } catch (err) {
console.error('OAuth callback error', err); logger.error({ err }, 'auth: OAuth callback error');
res.status(400).json({ error: 'OAuth error' }); res.status(400).json({ error: 'OAuth error' });
return; return;
} }
@@ -123,6 +124,45 @@ router.get('/callback', async (req: Request, res: Response) => {
.redirect(`${config.WEB_BASE_URL}${pending.redirectTo}`); .redirect(`${config.WEB_BASE_URL}${pending.redirectTo}`);
}); });
/**
* POST /api/auth/token
* Exchange the static ADMIN_TOKEN for a session cookie.
* Finds the first admin user in the DB; rejects if ADMIN_TOKEN is not configured.
*/
router.post('/token', async (req: Request, res: Response) => {
const { token } = req.body as { token?: string };
if (!config.ADMIN_TOKEN || !token || token !== config.ADMIN_TOKEN) {
res.status(401).json({ error: 'Invalid token' });
return;
}
const [adminUser] = await db
.select()
.from(users)
.where(eq(users.role, 'admin'))
.limit(1);
if (!adminUser) {
res.status(403).json({ error: 'No admin user exists' });
return;
}
const sid = nanoid(32);
const now = new Date().toISOString();
const expiresAt = new Date(Date.now() + 24 * 60 * 60 * 1000).toISOString();
await db.insert(sessions).values({ id: sid, userId: adminUser.id, expiresAt, createdAt: now });
res
.cookie('sid', sid, {
httpOnly: true,
secure: config.NODE_ENV === 'production',
sameSite: 'lax',
expires: new Date(expiresAt),
path: '/',
})
.json({ ok: true });
});
/** POST /api/auth/logout */ /** POST /api/auth/logout */
router.post('/logout', async (req: Request, res: Response) => { router.post('/logout', async (req: Request, res: Response) => {
const sid = req.cookies?.sid as string | undefined; const sid = req.cookies?.sid as string | undefined;

View File

@@ -1,5 +1,6 @@
import { type Router as ExpressRouter, Router, Response } from 'express'; import { type Router as ExpressRouter, Router, Response } from 'express';
import { nanoid } from 'nanoid'; import { nanoid } from 'nanoid';
import { logger } from '../logger.js';
import { db } from '../db/index.js'; import { db } from '../db/index.js';
import { integrationTokens, tipFeedback, tipViews, tipScores } from '../db/schema.js'; import { integrationTokens, tipFeedback, tipViews, tipScores } from '../db/schema.js';
import { eq, and, desc } from 'drizzle-orm'; import { eq, and, desc } from 'drizzle-orm';
@@ -47,7 +48,8 @@ export const _clearCandidateCacheForTests = () => {
// Shadow-policy registry // Shadow-policy registry
// --------------------------------------------------------------------------- // ---------------------------------------------------------------------------
const shadowPolicies = new Map<string, { active: boolean }>([ const shadowPolicies = new Map<string, { active: boolean }>([
// egreedy-v2 (D=12, profile features) — disabled until sim gate per ADR-0012 // egreedy-v2 promoted to active policy (ADR-0012). Shadow entry kept for
// rollback toggle; leave disabled in normal operation.
['egreedy-v2-shadow', { active: false }], ['egreedy-v2-shadow', { active: false }],
]); ]);
@@ -84,6 +86,7 @@ async function remotePolicy(
userId: string, userId: string,
tasks: TipCandidate[], tasks: TipCandidate[],
profile: Profile, profile: Profile,
traceparent?: string,
): Promise<{ tipId: string; score: number; policy: string } | null> { ): Promise<{ tipId: string; score: number; policy: string } | null> {
const hour = new Date().getHours(); const hour = new Date().getHours();
const dayOfWeek = new Date().getDay(); const dayOfWeek = new Date().getDay();
@@ -101,17 +104,16 @@ async function remotePolicy(
profile_features: profile, profile_features: profile,
}; };
// Active policy: egreedy-v1 (selected over linucb-v1 after offline sim — ADR-0007)
try { try {
const res = await fetch(`${config.ML_SERVING_URL}/score/egreedy`, { const res = await fetch(`${config.ML_SERVING_URL}/score/egreedy/v2`, {
method: 'POST', method: 'POST',
headers: { 'Content-Type': 'application/json' }, headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
body: JSON.stringify(body), body: JSON.stringify(body),
signal: AbortSignal.timeout(3000), signal: AbortSignal.timeout(3000),
}); });
if (!res.ok) return null; if (!res.ok) return null;
const data = (await res.json()) as { tip_id: string; score: number }; const data = (await res.json()) as { tip_id: string; score: number };
return { tipId: data.tip_id, score: data.score, policy: 'egreedy-v1' }; return { tipId: data.tip_id, score: data.score, policy: 'egreedy-v2' };
} catch { } catch {
return null; return null;
} }
@@ -145,6 +147,7 @@ async function fetchLlmCandidates(
dayOfWeek: number, dayOfWeek: number,
promptVersion: string | null, promptVersion: string | null,
profile: Profile, profile: Profile,
traceparent?: string,
): Promise<LlmGenerateResult> { ): Promise<LlmGenerateResult> {
try { try {
const tasks = signals.slice(0, 10).map((s) => ({ const tasks = signals.slice(0, 10).map((s) => ({
@@ -155,7 +158,7 @@ async function fetchLlmCandidates(
})); }));
const res = await fetch(`${config.ML_SERVING_URL}/generate`, { const res = await fetch(`${config.ML_SERVING_URL}/generate`, {
method: 'POST', method: 'POST',
headers: { 'Content-Type': 'application/json' }, headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
body: JSON.stringify({ body: JSON.stringify({
user_id: userId, user_id: userId,
context: { tasks, hour_of_day: hour, day_of_week: dayOfWeek }, context: { tasks, hour_of_day: hour, day_of_week: dayOfWeek },
@@ -225,6 +228,7 @@ router.post('/recommend', requireAuth, async (req: AuthenticatedRequest, res: Re
dayOfWeek, dayOfWeek,
requestedPromptVersion, requestedPromptVersion,
profile, profile,
req.traceparent,
); );
const allCandidates: TipCandidate[] = [...signalCandidates, ...llmResult.candidates]; const allCandidates: TipCandidate[] = [...signalCandidates, ...llmResult.candidates];
@@ -239,7 +243,7 @@ router.post('/recommend', requireAuth, async (req: AuthenticatedRequest, res: Re
const t0 = Date.now(); const t0 = Date.now();
// Stage 2: score — egreedy bandit with random fallback // Stage 2: score — egreedy bandit with random fallback
const scored = await remotePolicy(req.userId!, allCandidates, profile); const scored = await remotePolicy(req.userId!, allCandidates, profile, req.traceparent);
const latencyMs = Date.now() - t0; const latencyMs = Date.now() - t0;
const tip = scored const tip = scored
? (allCandidates.find((t) => t.id === scored.tipId) ?? randomPolicy(allCandidates)) ? (allCandidates.find((t) => t.id === scored.tipId) ?? randomPolicy(allCandidates))
@@ -371,6 +375,8 @@ async function sendRewardWithRetry(
tipId: string, tipId: string,
reward: number, reward: number,
features: TipCandidate['features'], features: TipCandidate['features'],
profile: Profile,
traceparent?: string,
): Promise<void> { ): Promise<void> {
const body = JSON.stringify({ const body = JSON.stringify({
user_id: userId, user_id: userId,
@@ -378,13 +384,14 @@ async function sendRewardWithRetry(
reward, reward,
features, features,
day_of_week: new Date().getDay(), day_of_week: new Date().getDay(),
profile_features: profile,
}); });
for (let attempt = 1; attempt <= 3; attempt++) { for (let attempt = 1; attempt <= 3; attempt++) {
try { try {
const res = await fetch(`${config.ML_SERVING_URL}/reward/egreedy`, { const res = await fetch(`${config.ML_SERVING_URL}/reward/egreedy/v2`, {
method: 'POST', method: 'POST',
headers: { 'Content-Type': 'application/json' }, headers: { 'Content-Type': 'application/json', ...(traceparent ? { traceparent } : {}) },
body, body,
signal: AbortSignal.timeout(3000), signal: AbortSignal.timeout(3000),
}); });
@@ -392,7 +399,7 @@ async function sendRewardWithRetry(
throw new Error(`HTTP ${res.status}`); throw new Error(`HTTP ${res.status}`);
} catch (err: any) { } catch (err: any) {
if (attempt === 3) { if (attempt === 3) {
console.error(`[reward] failed after 3 attempts for tip ${tipId}: ${err.message}`); logger.error({ tipId, err }, 'reward: failed after 3 attempts');
bus.publish('signals.tip.reward_failed', { bus.publish('signals.tip.reward_failed', {
userId, userId,
tipId, tipId,
@@ -463,7 +470,9 @@ router.post('/tip/:id/feedback', requireAuth, async (req: AuthenticatedRequest,
}); });
if (candidate) { if (candidate) {
sendRewardWithRetry(req.userId!, tipId, reward, candidate.features); // Re-fetch profile for the v2 ridge update; TTL cache makes this near-instant.
const profile = await getProfile(req.userId!);
sendRewardWithRetry(req.userId!, tipId, reward, candidate.features, profile, req.traceparent);
} }
// Delegate action to the owning signal source (e.g. mark done in Todoist) // Delegate action to the owning signal source (e.g. mark done in Todoist)

View File

@@ -8,6 +8,11 @@
*/ */
import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest'; import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
vi.mock('../../logger.js', () => ({
logger: { info: vi.fn(), warn: vi.fn(), error: vi.fn(), fatal: vi.fn() },
}));
import { logger } from '../../logger.js';
// ── mock the drizzle query chain: db.select(...).from(...).where(...) ──────── // ── mock the drizzle query chain: db.select(...).from(...).where(...) ────────
let users: { userId: string }[] = []; let users: { userId: string }[] = [];
const whereMock = vi.fn(async () => users); const whereMock = vi.fn(async () => users);
@@ -35,6 +40,7 @@ beforeEach(() => {
whereMock.mockClear(); whereMock.mockClear();
fromMock.mockClear(); fromMock.mockClear();
selectMock.mockClear(); selectMock.mockClear();
vi.clearAllMocks();
vi.useFakeTimers(); vi.useFakeTimers();
}); });
@@ -102,8 +108,6 @@ describe('startTodoistSyncScheduler', () => {
if (id === 'bad') throw new Error('todoist 401'); if (id === 'bad') throw new Error('todoist 401');
return []; return [];
}); });
const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
const logSpy = vi.spyOn(console, 'log').mockImplementation(() => {});
startTodoistSyncScheduler(60_000); startTodoistSyncScheduler(60_000);
await vi.advanceTimersByTimeAsync(10_001); await vi.advanceTimersByTimeAsync(10_001);
@@ -112,19 +116,27 @@ describe('startTodoistSyncScheduler', () => {
await Promise.resolve(); await Promise.resolve();
expect(fetchSignalsMock).toHaveBeenCalledTimes(3); expect(fetchSignalsMock).toHaveBeenCalledTimes(3);
expect(errSpy).toHaveBeenCalledWith(expect.stringContaining('sync error'), expect.anything()); expect(logger.error).toHaveBeenCalledWith(
expect(logSpy).toHaveBeenCalledWith(expect.stringContaining('2 ok, 1 failed')); expect.objectContaining({ err: expect.anything() }),
'scheduler: sync error',
);
expect(logger.info).toHaveBeenCalledWith(
expect.objectContaining({ ok: 2, failed: 1 }),
'scheduler: todoist sync',
);
}); });
it('survives a db query failure — logs and skips the tick', async () => { it('survives a db query failure — logs and skips the tick', async () => {
const { startTodoistSyncScheduler } = await import('../scheduler.js'); const { startTodoistSyncScheduler } = await import('../scheduler.js');
whereMock.mockRejectedValueOnce(new Error('sqlite locked')); whereMock.mockRejectedValueOnce(new Error('sqlite locked'));
const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
startTodoistSyncScheduler(60_000); startTodoistSyncScheduler(60_000);
await vi.advanceTimersByTimeAsync(10_001); await vi.advanceTimersByTimeAsync(10_001);
expect(fetchSignalsMock).not.toHaveBeenCalled(); expect(fetchSignalsMock).not.toHaveBeenCalled();
expect(errSpy).toHaveBeenCalledWith(expect.stringContaining('failed to query users')); expect(logger.error).toHaveBeenCalledWith(
expect.objectContaining({ err: expect.anything() }),
'scheduler: failed to query users',
);
}); });
}); });

View File

@@ -1,4 +1,5 @@
import type { Signal, SignalSource } from '@oo/shared-types'; import type { Signal, SignalSource } from '@oo/shared-types';
import { logger } from '../logger.js';
/** /**
* Merges signals from all registered sources for a user. * Merges signals from all registered sources for a user.
@@ -24,7 +25,7 @@ export class SignalAggregator {
if (r.status === 'fulfilled') { if (r.status === 'fulfilled') {
signals.push(...r.value); signals.push(...r.value);
} else { } else {
console.error(`[aggregator] source '${this.sources[i].id}' failed:`, r.reason); logger.error({ sourceId: this.sources[i]!.id, err: r.reason }, 'aggregator: source failed');
} }
} }
return signals; return signals;

View File

@@ -13,6 +13,7 @@ import { db } from '../db/index.js';
import { integrationTokens } from '../db/schema.js'; import { integrationTokens } from '../db/schema.js';
import { eq } from 'drizzle-orm'; import { eq } from 'drizzle-orm';
import { todoistSource } from './todoist.js'; import { todoistSource } from './todoist.js';
import { logger } from '../logger.js';
const DEFAULT_INTERVAL_MS = 15 * 60 * 1000; const DEFAULT_INTERVAL_MS = 15 * 60 * 1000;
@@ -25,7 +26,7 @@ export function startTodoistSyncScheduler(intervalMs = DEFAULT_INTERVAL_MS): Nod
.from(integrationTokens) .from(integrationTokens)
.where(eq(integrationTokens.tokenStatus, 'active')); .where(eq(integrationTokens.tokenStatus, 'active'));
} catch (err: any) { } catch (err: any) {
console.error(`[scheduler] failed to query users: ${err.message}`); logger.error({ err }, 'scheduler: failed to query users');
return; return;
} }
@@ -39,10 +40,10 @@ export function startTodoistSyncScheduler(intervalMs = DEFAULT_INTERVAL_MS): Nod
let failed = 0; let failed = 0;
for (const r of results) { for (const r of results) {
if (r.status === 'fulfilled') ok++; if (r.status === 'fulfilled') ok++;
else { failed++; console.error(`[scheduler] sync error:`, r.reason); } else { failed++; logger.error({ err: r.reason }, 'scheduler: sync error'); }
} }
console.log(`[scheduler] todoist sync: ${ok} ok, ${failed} failed (${users.length} users)`); logger.info({ ok, failed, total: users.length }, 'scheduler: todoist sync');
} }
// Run once shortly after startup, then on interval // Run once shortly after startup, then on interval

View File

@@ -3,6 +3,7 @@ import { db } from '../db/index.js';
import { integrationTokens } from '../db/schema.js'; import { integrationTokens } from '../db/schema.js';
import { eq, and } from 'drizzle-orm'; import { eq, and } from 'drizzle-orm';
import { bus } from '../events/bus.js'; import { bus } from '../events/bus.js';
import { logger } from '../logger.js';
const CACHE_TTL_MS = 30_000; const CACHE_TTL_MS = 30_000;
@@ -46,7 +47,7 @@ export class TodoistSignalSource implements SignalSource {
if (!res.ok) { if (!res.ok) {
if (res.status === 401) { if (res.status === 401) {
console.error(`[todoist] token expired for user ${userId}`); logger.warn({ userId }, 'todoist: token expired');
bus.publish('signals.integration.token_expired', { bus.publish('signals.integration.token_expired', {
userId, userId,
provider: 'todoist', provider: 'todoist',
@@ -88,7 +89,7 @@ export class TodoistSignalSource implements SignalSource {
}); });
this.cache.set(userId, { signals, fetchedAt: Date.now() }); this.cache.set(userId, { signals, fetchedAt: Date.now() });
bus.publish('signals.task.synced', { userId, count: signals.length, syncedAt: now }); bus.publish('signals.task.synced', { userId, source: 'todoist', count: signals.length, syncedAt: now });
return signals; return signals;
} }

View File

@@ -2,30 +2,49 @@
Third-party connectors and the token vault. Third-party connectors and the token vault.
## Connector interface ## Signal source interface
Each connector implements `SignalSource` from `@oo/shared-types`:
```ts ```ts
interface Connector { interface SignalSource {
id: string // e.g. "todoist" readonly id: string // e.g. "todoist"
scopes: string[] // human-readable list shown in consent UI fetchSignals(userId: string): Promise<Signal[]> // returns normalized Signal[]
beginOAuth(user): Promise<{ redirectUrl, state }> act?(userId: string, signalId: string, action: string): Promise<void> // optional write-back
finishOAuth(code, state): Promise<StoredCredential>
fetchSignals(user, since?): AsyncIterable<NormalizedEvent>
// incremental-sync cursor (Todoist sync_token, webhook timestamps, etc.)
// stored in Credential.meta; the connector owns its shape.
act?(user, action): Promise<void> // optional write-back (complete task, etc.)
revoke(user): Promise<void> // REQUIRED: provider-side token revocation on disconnect
} }
``` ```
`SignalAggregator` (`services/api/src/signals/aggregator.ts`) fans out to all registered sources in parallel, isolating per-source failures.
## Token vault ## Token vault
- Credentials encrypted at rest (libsodium sealed box); key from env/KMS. OAuth tokens stored in the `integration_tokens` SQLite table (`services/api/src/db/schema.ts`):
- Refresh handled transparently; consumers never see raw tokens.
- One row per `(user, provider)` with provider-specific `meta`.
## Roadmap | Column | Description |
|--------|-------------|
| `userId` | owner |
| `provider` | e.g. `todoist` |
| `accessToken` | OAuth access token (plain in dev; encrypted in prod via server secret store) |
| `tokenStatus` | `active` \| `needs_reconnect` |
- Phase 0: **Todoist** (OAuth2, read tasks, complete task). On a 401 from the upstream API, the connector marks the token `needs_reconnect` and publishes `signals.integration.token_expired` so the client can prompt re-auth.
- Phase 2: Google Calendar, Apple Health (web import), generic webhook ingress.
- Phase 5: public SDK so third parties can ship connectors. ## Routes
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/api/integrations` | List connected integrations for current user |
| `GET` | `/api/integrations/todoist/connect` | Start Todoist OAuth flow |
| `GET` | `/api/integrations/todoist/callback` | OAuth callback — exchange code, store token |
| `DELETE` | `/api/integrations/:provider` | Disconnect + delete token |
## Connectors
| Connector | Status | Signals produced |
|-----------|--------|-----------------|
| Todoist | Phase 1 — active | `task` signals (today + overdue); `done` write-back |
| Google Calendar | Phase 2 — planned | `event` signals |
## Extraction criteria
Extract to its own process when credential blast-radius isolation requires it (e.g. token vault with KMS-backed encryption needs to run in a hardened sidecar) or when connector volume justifies separate scaling.

View File

@@ -1,29 +1,42 @@
# recommender # recommender
The core of oO. Takes a user + a context, returns **one** tip. The core of oO. Takes a user + context, returns **one** tip.
## Contract ## Contract
``` ```
POST /recommend POST /api/recommend
{ user_id, context?: { time, timezone, client, ... } } { } (user inferred from session)
→ { tip: { id, kind: "todo"|"advice", title, body, source, deep_link, meta } } → { tip: { id, content, source, kind, sourceId?, rationale?, createdAt } }
POST /feedback POST /api/tip/:id/feedback
{ user_id, tip_id, reaction: "done"|"snooze"|"dismiss", at } { action: "done"|"dismiss"|"snooze"|"helpful"|"not_helpful", dwellMs? }
→ { ok: true }
``` ```
## Internals (stable seams) ## Pipeline
- **Candidate sources** — pluggable async generators. v0: Todoist tasks via `integrations`. Later: advice library, calendar nudges, health prompts. 1. **Signals**`SignalAggregator.fetchAll(userId)` fans out to all registered `SignalSource` implementations in parallel. Currently: `TodoistSignalSource`. Add a source via `aggregator.register(new MySource())`.
- **Feature assembler** — fills the `context` blob (inline in Phase 0; calls feature store from M1). Never inlined into policy code. 2. **LLM candidates**`POST /generate` on `ml/serving` returns `TipCandidate[]` from the `tip-generator` LiteLLM alias.
- **Policy registry** — `Policy.pick(candidates, context) → tip`. Named entries: 3. **Scoring**all candidates sent to `ml/serving` active policy (`POST /score/egreedy`). Falls back to random if `ml/serving` is unreachable.
- `random` — v0 (Phase 0). 4. **Shadow policies** — active policy runs shadow policies in the same request for offline comparison (ADR-0002). Currently: `egreedy-v2` shadows `egreedy-v1`.
- `bandit.linucb.pooled` — v1 (Phase 1). **Global-then-personalize**: pooled features shared across users; per-user residual once data allows. 5. **Persistence**`tipViews` + `tipScores` rows written on every serve; `tipFeedback` row on reaction.
- `remote` — delegates to `ml/serving` FastAPI scorer (Phase 1+). 6. **Reward delivery** — reaction triggers `POST /reward/egreedy` on `ml/serving` with inferred reward value.
- **Shadow hook** — every request optionally runs N shadow policies in parallel and logs their picks + estimated rewards. Promotion from shadow → A/B → launch is a separate, deliberate step (ADR-0002).
- **TipInstance persistence** — every decision writes `context_snapshot` (features seen at decision time). This is what makes offline replay honest.
## Phase 0 goal ## Signal normalization
`RandomPolicy` only. The service, contract, registry, shadow hook, and tip-instance persistence all exist; no ML yet. Signals carry `features: Record<string, number | boolean>` (bandit-ready) and `metadata: Record<string, unknown>` (source-specific raw fields). The bandit treats features as an opaque dict — sources own their feature names. See ADR-0009.
## Policy registry
| Policy | Status | Notes |
|--------|--------|-------|
| `random` | Fallback | Used when ml/serving is unreachable |
| `egreedy-v1` | Shadow | d=7, ADR-0007 |
| `egreedy-v2` | **Active** | d=12 + profile features, ADR-0012 |
Shadow → active promotion requires offline sim + online agreement (ADR-0002).
## Extraction criteria
Extract to its own process at scaling hotspot: when `POST /recommend` p99 latency exceeds SLA or when recommendation CPU displaces API serving on shared host.