feat(integrations): migrate google-health from Fit REST to Google Health API v4

Google Fit REST API was closed to new sign-ups on 2024-05-01 and shuts down end of 2026, surfacing as "Access blocked: this app's request is invalid" when starting the OAuth flow. - Swap the 10 fitness.* OAuth scopes for the 3 googlehealth.*.readonly scopes (activity_and_fitness, health_metrics_and_measurements, sleep). - Replace fitness/v1 dataset:aggregate + sessions calls with health.googleapis.com/v4/users/me/dataTypes/{steps,total-calories, heart-rate,sleep}/dataPoints, filtered to today's window. - Read the v4 DataPoint union defensively (the per-type schema is sparsely documented) and log the first raw sample at debug so we can refine field paths after the first real OAuth. - Output Signal contract is unchanged — agents and downstream consumers see the same steps/activity/heart_rate/sleep signals. Cloud Console still needs: enable Google Health API, add the 3 scopes to the consent screen, add test user (all googlehealth scopes are Restricted). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(infra): unblock docker builds for stars agent and web
2026-05-15 05:42:05 +00:00 · 2026-05-15 04:46:20 +00:00 · 2026-05-14 10:59:10 +00:00 · 2026-05-14 10:52:55 +00:00 · 2026-05-13 10:28:14 +00:00 · 2026-05-13 09:54:54 +00:00
140 changed files with 12300 additions and 2395 deletions
--- a/.dockerignore
+++ b/.dockerignore
@@ -0,0 +1,18 @@
 **/node_modules
 **/.next
 **/dist
 **/coverage
 **/.vitest-cache
 **/.turbo
 .git
 .gitea
 .github
 .vscode
 .idea
 **/.env
 **/.env.local
 **/*.log
 infra/docker/data
 **/__tests__
 **/*.test.ts
 **/*.test.tsx
--- a/.env.example
+++ b/.env.example
@@ -10,6 +10,21 @@ API_BASE_URL=http://localhost:3078
 WEB_BASE_URL=http://localhost:3000
 ML_SERVING_URL=http://localhost:8000
 # MLflow (mlops profile) — http://localhost:5000/mlflow in dev, https://o.alogins.net/mlflow in prod.
 # MLFLOW_ADMIN_PASSWORD seeds the admin account on first boot (changing it after first run
 # requires the MLflow UI or API — see infra/mlflow/basic_auth.ini).
 MLFLOW_URL=http://localhost:5000
 MLFLOW_ADMIN_PASSWORD=change-me
 # Public URL shown as link in the admin sidebar (must be NEXT_PUBLIC_ to reach the browser).
 NEXT_PUBLIC_MLFLOW_URL=http://localhost:5000
 # Shared secret for internal API callbacks. Generate: openssl rand -hex 32
 INTERNAL_API_TOKEN=
 # Static token for automated/service access to the admin panel (e.g. Playwright tests).
 # Leave empty to disable token-based login. Generate: openssl rand -hex 32
 ADMIN_TOKEN=
 # AI stack — shared Agap services (ollama + litellm + langfuse). Not run from oO.
 # Prod: https://llm.alogins.net  |  Dev: http://host.docker.internal:4000 from containers,
 # http://localhost:4000 from host. Ollama: http://host.docker.internal:11434 / :11434.
--- a/.gitea/workflows/buf-check.yaml
+++ b/.gitea/workflows/buf-check.yaml
@@ -0,0 +1,37 @@
 name: buf-check
 on:
  push:
    branches: [main]
    paths:
      - 'packages/shared-types/events/**'
  pull_request:
    paths:
      - 'packages/shared-types/events/**'
 jobs:
  buf:
    name: Lint & breaking-change check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Install buf
        run: |
          BUF_VERSION=1.50.0
          curl -sSfL \
            "https://github.com/bufbuild/buf/releases/download/v${BUF_VERSION}/buf-Linux-x86_64" \
            -o /usr/local/bin/buf
          chmod +x /usr/local/bin/buf
          buf --version
      - name: buf lint
        run: buf lint packages/shared-types/events
      - name: buf breaking
        if: github.event_name == 'pull_request'
        run: |
          buf breaking packages/shared-types/events \
            --against ".git#branch=${{ github.base_ref }},subdir=packages/shared-types/events"
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -42,7 +42,7 @@ packages/          shared libraries (importable across services + apps)
 ml/                Python — separate deployable from day one
  serving/         online scorer (FastAPI), called by recommender
  features/        feature definitions + store adapter
-  pipelines/       batch feature + training DAGs (Prefect/Airflow)
+  pipelines/       batch feature + training scripts
  registry/        MLflow model registry integration
  experiments/     assignment + A/B + bandit policies
  notebooks/       research only; never imported by production code
@@ -56,7 +56,7 @@ docs/              architecture notes, ADRs, API specs
 ## Contracts between modules
 - **HTTP** (OpenAPI, in `packages/shared-types/http/`) — synchronous request/response. In-process today; over the network once extracted. Signatures are identical.
- **Events** (Protocol Buffers, in `packages/shared-types/events/`) — durable signals + feedback. Today: in-process `Bus` with a `onPublish` bridge to NATS JetStream when `NATS_URL` is set (ADR-0010). The in-proc bus stays the source of truth — JetStream is the durable mirror that cross-process consumers (`ml/serving`, future feature pipelines) tail. Schema registry enforced in CI when #54 lands; until then payloads are JSON envelopes (ADR-0005).
+- **Events** (Protocol Buffers, in `packages/shared-types/events/`) — durable signals + feedback. Today: in-process `Bus` with a `onPublish` bridge to NATS JetStream when `NATS_URL` is set (ADR-0010). The in-proc bus stays the source of truth — JetStream is the durable mirror that cross-process consumers (`ml/serving`, future feature pipelines) tail. Proto schemas (ADR-0005) live in `packages/shared-types/events/oo/events/v1/`; `buf lint` + `buf breaking` run in CI on every PR touching those files (`.gitea/workflows/buf-check.yaml`).
 - Do not redefine types per module. Regenerate from `shared-types`.
 ## Conventions
@@ -65,7 +65,18 @@ docs/              architecture notes, ADRs, API specs
 - One PR = one concern. Conventional-commit prefixes (`feat:`, `fix:`, `chore:`, `docs:`, `refactor:`).
 - ADRs go in `docs/adr/NNNN-title.md` for any decision that constrains future work.
 - No secrets in repo. Local dev via `.env.local` (gitignored), prod via the server's secret store (Vaultwarden now; k8s secrets later).
- Compose profiles: `core` (api + web + admin), `full` (adds ml-serving), `mlops` (adds MLflow + Airflow), `ai` (adds Ollama + LiteLLM). Mix as needed.
+- Compose profiles: `core` (api + web + admin), `full` (adds ml-serving + nats), `mlops` (adds MLflow), `ai` (adds Ollama + LiteLLM). Mix as needed. Always pass `--profile <name>` to `build`/`up` — without a profile, no services are selected and builds silently do nothing.
 - Docker rebuild: use `--force-recreate` on `up` when only env vars changed (no image rebuild needed); new env vars in `.env.local` are not picked up by a running container until it is recreated.
 - Docker rebuild gotchas:
  - **Never run two `docker compose up --build` at once** — both grab the same `--mount=type=cache,id=pnpm` and deadlock on the API's `pnpm --prod deploy` step. Symptom: build sits silent for hours on `[api builder 8/8]`. Before starting any build, check `ps aux | grep "docker compose"` and kill any prior `up --build` (`kill -9 <pid>` — the wrapper bash and the docker compose binary are separate PIDs; kill the docker compose one).
  - **Don't add `--offline` to `pnpm --prod deploy`** — pnpm's metadata cache (`/root/.cache/pnpm/`) is not in the `/pnpm/store` cache mount, so `--offline` fails with `ERR_PNPM_NO_OFFLINE_META` for transitive devDeps (e.g. vite via vitest). Leave the deploy step network-on; it works.
  - **All TS Dockerfiles need `python3 make g++`** in the base stage — `better-sqlite3` rebuilds natively on install. Missing from `Dockerfile.admin` historically caused `gyp ERR! find Python` failures.
  - **`Dockerfile.ml` needs `build-essential`** (not just `gcc`) — `pyswisseph` (stars agent) compiles C from source and fails with `fatal error: math.h: No such file or directory` if only `gcc` is installed; it needs `libc-dev` too, easiest via `build-essential`.
  - **`Dockerfile.web` builder stage needs root `package.json` + `pnpm-workspace.yaml` + `pnpm-lock.yaml`** copied in. Without them, `pnpm --filter @oo/shared-types build` fails with `[ERR_PNPM_NO_PKG_MANIFEST] No package.json found in /app`. The deps stage has them but the builder is a fresh layer; selective copies must include them.
  - **A clean build of `--profile core` takes ~3 min total** when the buildx cache is warm. If it's been silent for >10 min, check for the parallel-build deadlock above before assuming "still going".
 - Run Python agent tests: `python3 -m pytest ml/agents/tests/ -x -q` (tests add repo root to `sys.path` themselves).
 - Run Python feature tests: `python3 -m pytest ml/features/ -x -q`
 - `ml/features/` files are Python mirrors of TS registries — TS is source of truth. Tests parse `registry.ts` with regex to detect drift; follow the same pattern whenever a new field is added to `ProfileFeature`.
 ## Definition of done (per feature)
@@ -78,37 +89,174 @@ docs/              architecture notes, ADRs, API specs
 ## AI stack
-oO generates tips with an LLM and ranks them with a bandit. All LLM calls route through **LiteLLM** at `llm.alogins.net` using model aliases — swapping models is a config change, not a code change.
+oO generates tips through a multi-agent pipeline (ADR-0013): pre-compute agents emit prompt snippets, an orchestrator LLM assembles them into one tip. All LLM calls route through **LiteLLM** at `llm.alogins.net` using model aliases — swapping models is a config change, not a code change.
 | Alias | Model | Used by |
 |-------|-------|---------|
 | `tip-generator` | qwen2.5:1.5b (default) | `ml/serving` tip generation |
-| `embedder` | nomic-embed-text | task clustering, dedup |
+| `embedder` | nomic-embed-text | task clustering (after LLM enrichment), dedup |
 | `judge` | claude-haiku-4-5 (cloud, eval only) | offline sim |
 Env vars: `LITELLM_URL` (prod `https://llm.alogins.net`), `OLLAMA_URL` (Agap host, `http://host.docker.internal:11434` from containers).
 Ollama and LiteLLM are **shared Agap services**, not oO services — they live in `agap_git/openai/docker-compose.yml` along with langfuse (observability). oO never starts them; ml-serving just calls the alias.
-**LLM tip generation pipeline:**
+All `httpx` calls in `ml/` must use `trust_env=False` to bypass the system proxy — same rule as `bw` and curl. Pattern: `httpx.Client(trust_env=False, timeout=N)`.
-1. `ml/features/context.py` assembles user signals → structured prompt context
+
-2. `POST /generate` in `ml/serving` calls LiteLLM → returns `TipCandidate[]`
+MLflow container-to-container calls: always pass `host_header="localhost"` to `MLflowClient` — MLflow's `--allowed-hosts` rejects `Host: mlflow` (the container DNS name) with 403. Auth credential is `MLFLOW_ADMIN_PASSWORD`. MLflow REST API lives at the origin root, not under the `/mlflow` UI prefix.
-3. Bandit policy in `ml/serving` scores + ranks candidates
+
-4. Best candidate returned as tip; reaction closes the online reward loop
+### MLflow API versions — runs vs traces
 MLflow uses **two API versions** — use the right one or you'll get 405:
 | What | API prefix | Example |
 |------|-----------|---------|
 | Runs, experiments, metrics | `/api/2.0/mlflow/` | `runs/search`, `experiments/list` |
 | Traces (LLM observability) | `/api/3.0/mlflow/traces/` | `traces/{trace_id}` |
 **Experiment IDs:** `3` = oO/serving. Artifacts stored as run tags prefixed `artifact:<path>`.
 ### Querying from the host shell
 Always strip the proxy and pass `Host: localhost` (no port — `localhost:5000` fails the DNS-rebinding check).
 ```bash
 # Search recent runs (experiment 3)
 env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \
  curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \
  -X POST http://localhost:5000/api/2.0/mlflow/runs/search \
  -H "Content-Type: application/json" \
  -d '{"experiment_ids":["3"],"max_results":5,"order_by":["start_time DESC"]}'
 # Get a trace by ID (note: /api/3.0/, not /api/2.0/)
 env -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY -u https_proxy -u http_proxy -u all_proxy \
  curl -s -H "Host: localhost" -u "admin:${MLFLOW_ADMIN_PASSWORD}" \
  http://localhost:5000/api/3.0/mlflow/traces/tr-<trace_id> | python3 -m json.tool
 ```
 The trace response includes `trace_metadata.mlflow.traceInputs/Outputs`, `trace_metadata.mlflow.trace.sizeStats` (num_spans), and `tags.mlflow.traceName`.
 ### Getting spans (Python client from inside the container)
 The REST API has **no endpoint for spans** — `/api/3.0/mlflow/traces/{id}/spans` returns 404. Use the Python client inside `oo-ml-serving-1`:
 ```bash
 docker exec oo-ml-serving-1 python3 -c "
 import mlflow, json, os
 mlflow.set_tracking_uri('http://mlflow:5000')
 os.environ['MLFLOW_TRACKING_USERNAME'] = 'admin'
 os.environ['MLFLOW_TRACKING_PASSWORD'] = os.environ.get('MLFLOW_ADMIN_PASSWORD', '')
 client = mlflow.tracking.MlflowClient()
 trace = client.get_trace('tr-<trace_id>')
 for span in trace.data.spans:
    print(span.name, '| parent:', span.parent_id, '| status:', span.status)
    print('  inputs:', json.dumps(span.inputs)[:200])
    print('  outputs:', json.dumps(span.outputs)[:200])
    print('  attrs:', span.attributes)
 "
 ```
 ### Span structure for a tip generation trace
 A healthy `recommend` trace has 3 spans:
 | Span | Type | Parent | Key attributes |
 |------|------|--------|---------------|
 | `recommend` | CHAIN | (root) | `agent_count`, `latency_ms`; inputs include `agent_ids` list |
 | `build_context` | TOOL | recommend | `agent_count`, `task_count`, `science_destiny` |
 | `llm_orchestrator` | LLM | recommend | `prompt_tokens`, `completion_tokens`, `model`, `attempts` |
 ### Diagnosing "no agents in trace"
 If the trace shows `agent_ids: []` and `agent_count: 0` in the root span, and the orchestrator prompt says *"No pre-computed agent context available"*, it means the recommender found zero eligible snippets at request time. Causes:
 1. **Agent compute hasn't run** — no `agent_outputs` rows for this user yet
 2. **Snippets expired** — TTL elapsed since last compute
 3. **Eligibility filter dropped all agents** — none passed the manifest-driven check
 Diagnose with:
 ```bash
 docker exec oo-api-1 psql "$DATABASE_URL" -c \
  "SELECT agent_id, computed_at, expires_at FROM agent_outputs WHERE user_id='<uid>' ORDER BY computed_at DESC LIMIT 10;"
 ```
 **Multi-agent tip generation pipeline (ADR-0013):**
 1. Pre-compute agents (`ml/agents/<id>/`) run on a schedule, each emitting a snippet into `agent_outputs` with a per-agent TTL
 2. On request, `recommender` (TS) loads the eligible agent set (registry-driven, ADR-0014) and pulls the freshest non-expired snippets
 3. `POST /recommend` in `ml/serving` assembles the orchestrator prompt (`v4-orchestrator`) and calls LiteLLM via the `tip-generator` alias
 4. Returned tip is logged in `tip_scores` with the contributing agent set; reaction is logged for observability (no bandit reward loop)
 ## Current phase
-**M1 shipped. M2 (AI tips) in progress.** See `README.md` for the phase roadmap and `docs/architecture/` for diagrams. Work is tracked as Gitea milestones + issues on `alvis/oO`.
+**M1 shipped (core + admin). M2 (AI tips) in progress.** See `README.md` for the phase roadmap and `docs/architecture/` for diagrams. Work is tracked as Gitea milestones + issues on `alvis/oO`.
-Active work: AI tip generation pipeline — issues #86–#93 in M2 milestone.
+Recent completions:
 - ADR-0013 — multi-agent recommendation: pre-computed agent snippets + orchestrator LLM (replaces ε-greedy bandit) — 2026-05-01
 - LLM context assembler + tip generation scaffold (#79, #88)
 - Model benchmarking for tip generation (#93, #95)
 - Admin UX refinements: feedback consolidation, settings placement (#100–102)
 - ADR-0012 — ε-greedy v2 (D=12) — 2026-04-26 (now superseded by ADR-0013)
 - ADR-0014 complete: unified Profile schema + backfill, manifest plumbing, `/api/profile` read-through, registry-driven eligibility filter, inference framework + per-agent inference, legacy consent column drop — 2026-05-05
 - Rich per-agent inference for all four active agents (#112, #114, #115, #116) — 2026-05-06: quiet/peak hours (time-of-day), z-score baseline (momentum), p50 lateness + project realness (overdue-task), adaptive lookback + weekly/daily cycles (recent-patterns)
 - Semantic task clustering via nomic-embed-text + LLM enrichment (#97, #113, #129) — 2026-05-12: `ml/agents/clustering.py`; titles expanded via `tip-generator` before embedding; persistent cache in `task_enrichments` table; recompute gated on task-list hash change; focus-area v3.0.0 outputs all clusters with enriched descriptions
 - Per-user feature freshness SLAs (#61) — 2026-05-06: `invalidated_by` mirrored into `ProfileFeature`; drift-detection test added
 - MLflow tracing added to `ml/serving` for all agent calls — 2026-05-06: `ml/serving/mlflow_client.py`; activated by `MLFLOW_TRACKING_URI=http://mlflow:5000` (default in compose `full` profile); requires `--profile mlops` for the MLflow container. Issue #118 (M4) tracks removal from production critical path.
 Active work (M2): *(all M2 items complete — see README for M3 planning)*
 ## ADR-0014 endpoint map (as of step 6)
 | Endpoint | Purpose |
 |----------|---------|
 | `GET /api/profile` | Read-through: user globals + prefs (by scope) + consents + contexts |
 | `PATCH /api/profile/prefs/:scope` | Upsert user_preferences rows (source='user') |
 | `PATCH /api/profile/consents` | Grant / revoke consent keys |
 | `PATCH /api/profile/contexts` | Create / activate / deactivate named contexts |
 | `GET /api/agents/registry` | Manifest list (proxy to ml/serving; 60 s cache) |
 | `POST /api/agents/:agentId/compute` | Internal: run agent compute for (user, agent) |
 | `POST /agents/{agent_id}/infer` *(ml/serving)* | Run inference framework → `{inferred_prefs}` |
 ## Inference framework (ADR-0014 §3)
 Lives in `ml/agents/inference/`. `run_inference(manifest, history)` evaluates all `InferredParam` entries in the manifest and returns `{key: value}`. Rules:
 - Below `min_history` → emit `cold_start_default`
 - `infer()` error → emit `cold_start_default` (never crashes)
 - Results written to `user_preferences` with `source='inferred'`; keys with `source='user'` are never overwritten
 Per-agent inferred params (all live in `ml/agents/<name>.py`):
 | Agent | Inferred params | Notes |
 |-------|----------------|-------|
 | `time-of-day` | `preferred_hour`, `quiet_start`, `quiet_end`, `peak_hours`, `tz` | Quiet window = longest below-baseline hour run; peak = top-quartile done hours; tz cold-start only (from auth provider) |
 | `momentum` | `engagement_trend`, `baseline_completions_per_day`, `stdev` | Baseline = 28d rolling mean done/day; snippet uses z-score language |
 | `overdue-task` | `lateness_tolerance_days`, `project_realness` | Tolerance = p50 lateness from TaskCompletion history; realness = project median vs global median |
 | `recent-patterns` | `lookback_days`, `weekly_cycle`, `daily_cycle` | Lookback sized to ≥30 done events; cycles use peak-to-mean ratio; snippet hints when strength > 0.5 |
 | `focus-area` | *(none)* | No inferred params. Clusters tasks via LLM-enriched embeddings and outputs all areas with expanded descriptions. Recomputes only when task list changes (hash-gated). |
 `UserHistory` carries both `events: list[FeedbackEvent]` and `task_completions: list[TaskCompletion]`. `AgentInferRequest` (ml/serving) accepts `task_completions: list[dict]` alongside `feedback_history`.
 `min_history` is checked against `len(history.events)` (feedback events), **not** `task_completions`. Agents that infer from completions should set `min_history=0` and guard inside `infer()`.
 ## What NOT to do
 - Don't copy Todoist's data into our DB. Store the OAuth token + computed features/derivatives we need, fetch raw on demand.
 - Don't implement auth by hand. Auth.js behind an OIDC-shaped boundary (ADR-0004); swap to a dedicated OIDC provider only when mobile ships.
- Don't hardwire a recommender. The contract is `POST /recommend → {tip}`. Swap internals (bandit, LLM, hybrid), keep contract.
+- Don't hardwire a recommender. The contract is `POST /recommend → {tip}`. Swap internals (multi-agent orchestrator today, future LLM/hybrid variants), keep contract.
 - Don't hardcode the agent list. The orchestrator is registry-driven (ADR-0014); adding/removing an agent is a manifest change in `ml/agents/<id>/`, never a recommender edit.
 - Don't replace a policy in one step. New policies deploy shadow-first; promoted only after offline + online agreement with the incumbent (ADR-0002).
 - Don't over-split processes. Extract a service when pressure demands it, not in anticipation (ADR-0003).
 - Don't call LLMs directly from application code. All LLM calls go through `ml/serving` (Python) via `LITELLM_URL`. The TS recommender never holds a model name.
- Don't embed MLflow/Airflow/OpenWebUI in the admin panel. They are external services; link out to them. The admin shell links to `o.alogins.net/mlflow`, `/airflow`, `ai.alogins.net`.
+- Don't embed MLflow/OpenWebUI in the admin panel. They are external services; link out to them. The admin shell links to `o.alogins.net/mlflow`, `ai.alogins.net`.
 - Don't `nats.publish()` directly from feature code. All publishes go through the in-process `Bus` (`services/api/src/events/bus.ts`); the NATS adapter (`events/nats.ts`) bridges every publish to JetStream when `NATS_URL` is set. This keeps subscribers, the ring-buffer tail used by the admin event viewer, and JetStream all in lockstep.
 ## Admin app
 `apps/admin` rewrites `/api/*` → `$NEXT_PUBLIC_API_URL/api/*` via `next.config.ts`. So `apiFetch('/admin/stats')` in `apps/admin/src/lib/api.ts` hits the Express backend, not a Next.js route.
 Running `tsc --noEmit -p apps/admin/tsconfig.json` always reports `Cannot find module 'next'` errors — expected outside the Next.js build context; use `next build` for real type errors.
 ## Auth / session pattern
 Sessions use an `sid` cookie. Admin routes stack `requireAuth` (sets `req.userId`) then `requireAdmin` (checks `role = 'admin'` in DB). Token-based admin auth: `POST /api/auth/token` with `{ token }` matching `ADMIN_TOKEN` env var sets the `sid` cookie — used by Playwright and CI.
--- a/README.md
+++ b/README.md
@@ -69,7 +69,7 @@ docs/        architecture, adr, api
 ## AI stack
-oO is AI-native: the recommender's job is to **rank**, not to write. An LLM generates candidate tips from the user's context; the bandit picks the best one.
+oO is AI-native. Domain-specialized agents pre-compute snippets describing the user's state from one angle each; an orchestrator LLM reasons over the assembled snippets and produces one tip (ADR-0013). The orchestrator iterates a registry, not a hardcoded list (ADR-0014) — adding an agent is a manifest change, nothing else.
 ### Three-tier layout
@@ -79,193 +79,73 @@ oO is AI-native: the recommender's job is to **rank**, not to write. An LLM gene
 | Routing | **LiteLLM** | Unified OpenAI-compatible API; model aliases; cloud fallback | `llm.alogins.net` (Agap shared) |
 | Testing | **OpenWebUI** | Prompt iteration, model comparison, manual evals | `ai.alogins.net` (Agap shared) |
-### Tip generation pipeline (Phase 2 target)
+### Tip generation pipeline (ADR-0013, M2)
 ```
-User signals  ──▶  Context assembler  ──▶  LiteLLM  ──▶  Ollama (local)
+User signals          Pre-compute agents (every 15 min)
-(tasks, calendar,    (ml/features/)         (routing)     or cloud fallback
+(tasks, calendar,  ──▶ ml/agents/{overdue-task, momentum,        ──▶  agent_outputs
- patterns, time)
+ patterns, time)        time-of-day, recent-patterns,                 (per-agent TTL)
                        focus-area, ...}
                                                                            │
                              Eligibility filter: required consents +       │
                              active context + per-user prefs (ADR-0014) ◀──┘
                                                ▼
-                                     N typed TipCandidates
+                                  Orchestrator prompt (`v4-orchestrator`)
-                                     {content, kind, model,
+                                  = global prefs + active context + snippets
                                      prompt_version, confidence}
                                                ▼
-                                    Bandit policy (ml/serving)
+                                    LiteLLM ──▶ Ollama (local) / cloud fallback
                                    scores + ranks candidates
                                                ▼
-                                         Best tip shown
+                                         Tip shown to user
                                                ▼
                              User reaction (done / snooze / dismiss + dwell)
                                                ▼
-                              Online bandit update + prompt_version tracking
+                              Logged to tip_feedback for observability
                              (no online ML reward loop — see ADR-0013)
 ```
 **Why LiteLLM as gateway:**  All LLM calls use a single `LITELLM_URL` env var. Swapping from qwen2.5 to llama3.2, or routing a fraction to Claude for A/B, is a config change in LiteLLM — zero code change in oO. The model name in `tip_scores` tells you exactly which model produced each tip.
 **Why Ollama first:**  Tips contain personal context. Local inference means no user data leaves the host for the inference path. Cloud models (Anthropic, OpenAI) are opt-in fallbacks for evaluation and simulation only, gated behind `ANTHROPIC_API_KEY`.
-### Models (planned)
+### Models (planned; routes through LiteLLM)
 | Alias | Model | Task |
 |-------|-------|------|
-| `tip-generator` | qwen2.5:7b (default) | Generate typed tip candidates from user context |
+| `tip-generator` | qwen2.5:1.5b (default) | Generate typed tip candidates from user context; local-first via Ollama |
-| `embedder` | nomic-embed-text | Task clustering, semantic similarity for dedup |
+| `embedder` | nomic-embed-text | Task clustering, semantic similarity for dedup; local via Ollama |
-| `judge` | claude-haiku-4-5 (cloud, eval-only) | Offline sim judge; rates tip quality for A/B |
+| `judge` | claude-haiku-4-5 (cloud, eval-only) | Offline sim judge; rates tip quality for A/B (requires `ANTHROPIC_API_KEY`) |
 All model calls route through **LiteLLM** at `llm.alogins.net` (or `LITELLM_URL` env var) using model aliases. This decouples tip generation from model selection — swap the backend model in LiteLLM config without code changes. See ADR-0008.
 ---
 ## Roadmap
 Issues and open work are tracked in [Gitea milestones](http://localhost:3000/alvis/oO/milestones). Pick an issue, check its milestone (= phase), read the service's `README.md`, ship.
 ### Phase 0 — Walking skeleton  *(M0)* ✓ shipped
-Goal: a single user signs in with Google, connects Todoist, and sees one random Todoist task on a black page. Deletion works.
+Single user signs in with Google, connects Todoist, sees one random task on a black page. Deletion works. Auth, integrations, recommender stub, PWA, feedback loop, ToS/privacy, metrics baseline.
 - [x] Monorepo scaffold, docker-compose dev env
 - [x] `auth` — Google OAuth2/PKCE via openid-client v6; session cookie; Next.js middleware guard
 - [x] `integrations/todoist` — OAuth2 flow, token stored in DB, disconnect supported
 - [x] `recommender` with `RandomPolicy`; stable `POST /recommend` contract; 30s task cache
 - [x] `apps/web` — sign-in, connect, tip pages; PWA manifest + icons
 - [x] Feedback: `done / snooze / dismiss`; reward inferred from dwell-time (`inferReward`); marks task complete in Todoist
 - [x] Deploy modular monolith to Agap VM via Caddy at `o.alogins.net`
 - [x] ToS + Privacy Policy pages (`/legal/terms`, `/legal/privacy`); implicit consent on sign-in
 - [x] Account deletion: revokes tokens, purges data, soft-deletes profile; button on /connect
 - [x] Metrics baseline: `tip_views` table (tip served) + `tip_feedback` (reactions) — activation + reaction rate queryable
 ### Phase 1 — Real signal + in-the-moment delivery  *(M1)* ✓ shipped
-Goal: tips are picked, not drawn from a hat — and they arrive at the right moment on the web.
+Tips are picked, not drawn from a hat. Event bus, Todoist sync, task features, ε-greedy policy (v1 + v2), web push, NATS JetStream bridge, shadow-policy registry, offline sim framework, per-user profile features, admin + ML ops console (`apps/admin`).
 - [x] Event bus scaffold: typed in-process EventEmitter with 500-event ring buffer; subjects match future NATS JetStream — swap is mechanical
 - [x] Todoist sync emits `signals.task.synced`; tip served/feedback emit `signals.tip.*`
 - [x] Features extracted per task: `is_overdue`, `task_age_days`, `priority`; context: `hour_of_day`, `day_of_week`
 - [x] `ml/serving` LinUCB (d=5) + **ε-greedy v1** (d=7, ε=0.10, day-of-week sin/cos features); per-user state persisted to disk
 - [x] `RemotePolicy` in recommender: calls ml/serving, falls back to RandomPolicy on timeout/error; logs explainability to `tip_scores`
 - [x] Feedback loop: dwell-time inferred reward (`inferReward`) → online model update; `done` in 15 s–2 min = +1.0 (magic zone)
 - [x] Offline simulation framework (`ml/experiments/sim`): rule/LLM/claude-code judges, two-policy comparison, results persisted to `sim_runs` + `sim_events`
 - [x] **ε-greedy v1 promoted to active policy** (ADR-0007) — +10.7% mean reward vs LinUCB in offline sim
 - [x] **Web Push** (VAPID): SW, subscribe/unsubscribe API, "notify me" button on tip page
 - [x] Shadow-policy registry: run N shadow policies per request, log picks without serving them (#56)
 - [ ] Quiet-hours + dedupe for push delivery
 - [ ] Delayed rewards: tasks completed directly in Todoist (requires webhook from Todoist)
 - [x] NATS JetStream bridge — durable `signals.>` and `feedback.>` streams; in-process bus stays the source of truth, every publish bridges out (#21, shipped)
-#### M1 add-on — Admin & ML Ops Console  *(fully shipped)*
+### Phase 2 — AI tips + multi-source signals  *(M2)* ✓ shipped
-
+Tips are AI-generated from user context. Multi-agent pipeline (ADR-0013): five pre-compute agents (`overdue-task`, `momentum`, `time-of-day`, `recent-patterns`, `focus-area`) emit prompt snippets; orchestrator LLM produces one tip. Unified Profile + agent registry + auto-inference framework (ADR-0014). LLM output validation + fallback. LiteLLM gateway, model benchmarking, prompt research, MLflow tracing.
 oO is ML-heavy. Without a cockpit, every model change ships blind. This console is the team's single pane for users, signals, features, models, experiments, and tip outcomes — with the ability to *act* on them (revoke a token, replay an event, promote a model, reset a bandit).
 **Framework pick — `apps/admin` on Next.js 15 + Tremor + shadcn/ui.**  Analytics-first UI for an analytics-first product, stays on our existing TS/React/Tailwind stack, reuses `packages/shared-types`, `sdk-js`, and the Auth.js session. Specialized ML tooling (MLflow, Airflow) runs as **separate external services** linked from the admin shell; Grafana panels are embedded.
 | Layer | Tool | Why |
 |-------|------|-----|
 | App shell | **Next.js 15** (new `apps/admin`) | Same stack as `apps/web`; reuses auth, types, SDK |
 | Dashboards / charts | **[Tremor](https://tremor.so)** | Analytics-first React + Tailwind — KPI cards, time-series, categorical, heatmaps |
 | CRUD primitives | **[shadcn/ui](https://ui.shadcn.com)** | Copy-paste Radix components; forms, dialogs, command palette |
 | Heavy grids | **[TanStack Table v8](https://tanstack.com/table)** | Sortable / paginated / virtualized tables (events, users, tips) |
 | Extra charts | **[Recharts](https://recharts.org)** / **[visx](https://airbnb.io/visx)** | Fallbacks where Tremor falls short (e.g. force graphs, Sankey) |
 | Model registry / experiments | **[MLflow](https://mlflow.org)** *(external — `o.alogins.net/mlflow`)* | Experiment tracking, artifact browser, model registry; own basic-auth |
 | Pipeline orchestration | **[Airflow](https://airflow.apache.org)** *(external — `o.alogins.net/airflow`)* | Batch feature + retraining DAGs; own web-auth |
 | Infra metrics | **[Grafana](https://grafana.com)** *(embedded panels)* | One ops source of truth |
 | Ad-hoc analysis | **[Marimo](https://marimo.io)** reactive notebooks | Python-native for the ML side; launch-out link |
 | AuthZ | `profile.role='admin'` + Next.js middleware | Reuses existing session; no new auth surface |
 **Rejected alternatives (so we don't re-litigate):**
 - *Retool / AppSmith* — low-code speed, but admin logic leaves our repo; weak analytics affordances for an analytics product
 - *Streamlit / Gradio / Dash* — Python-first; thin RBAC and routing; splits our frontend stack in two
 - *React-admin / Refine.dev* — strong CRUD scaffolding, but analytics/ML views feel bolted on; we'd rebuild Tremor-style dashboards ourselves
 - *Superset / Metabase as the admin surface* — excellent for BI, poor for operational **writes** (revoke, replay, promote). Plan: **adopt Superset in M4** for BI alongside batch pipelines; ship a read-only SQL widget inside admin for now
 **Build sequence (plan, not code):**
 1. [x] **ADR-0006** — record the framework choice + "embed, don't rebuild" rule for MLflow/Grafana
 2. [x] **Scaffold** — `apps/admin` with Next.js 15, Tailwind, Tremor; deploy behind Caddy at `admin.o.alogins.net`
 3. [x] **RBAC** — `role` column on `users`; admin-only Next.js middleware; seed first admin via `ADMIN_SEED_EMAIL` env; `admin_actions` audit-log table
 4. [x] **Overview dashboard** — DAU/WAU KPI cards, tips served, reaction breakdown, activation funnel
 5. [x] **User explorer** — list + detail page: identity, consents, integrations, last tip, reward history; revoke-integration + reset-bandit actions
 6. [x] **Event stream viewer** — live tail of `signals.*` with filters by subject/user/time; same UI when the bus swaps to NATS
 7. [x] **Feature store browser** — features sent to `ml/serving` per scoring call; diff across time for a user
 8. [x] **Model registry panel** — `/admin/models` links out to MLflow (`mlflow.o.alogins.net`); experiment tracking and dataset management in MLflow + Airflow
 9. [x] **MLOps hub** — `/admin/experiments` links to MLflow experiments/models and Airflow DAGs/datasets; bandit reset on Users page
 10. [x] **Recommendation log (explainability)** — per served tip: `(user, features, policy, score, feedback, latency)`; `tip_scores` table, 30-day retention
 11. [x] **Reward analytics** — reaction distribution over time; per-policy compare; slice by `hour_of_day`, `priority`, cohort
 12. [x] **Data quality widget** — missing-feature rate, stale-token rate, daily completeness heatmap
 13. [x] **Ops actions** — revoke token (Users page), replay signal, disable/promote shadow policy; every action audit-logged
 14. [x] **Read-only SQL runner** — SELECT-only runner against SQLite + saved queries (sunsets to Superset in M4)
 15. [x] **Health rollup** — `/admin/health` surfaces api, ml/serving, SQLite, event-bus; auto-refreshes every 15s
 16. [ ] **Docs** — `apps/admin/README.md`, runbook for common ops actions, ADR-0006 merged
 - [ ] Apple OAuth (deferred to M2)
 ### Phase 2 — AI tips + multi-source signals  *(M2)*
 Goal: tips are AI-generated from user context, not just raw Todoist tasks. Multiple signal sources feed a generalized pipeline. Research-intensive milestone.
 **AI infrastructure (unblock everything else):**
 - [ ] `ai` compose profile — Ollama + LiteLLM for local dev; env vars `OLLAMA_URL` / `LITELLM_URL` (#86)
 - [ ] AI gateway — wire `ml/serving` to LiteLLM; model aliases `tip-generator` + `embedder` (#87)
 **AI tip generation pipeline:**
 - [ ] Context assembler — user signals + feature store → structured prompt context (`ml/features/context.py`) (#88)
 - [ ] Tip generator endpoint — `POST /generate` in `ml/serving`; LLM → N typed `TipCandidate` objects (#79)
 - [ ] `TipCandidate` shared schema — `{content, kind, source, model, prompt_version, confidence}`; update recommender pipeline (#89)
 - [ ] LLM output validation + retry — JSON schema gate, clarification retry (2×), fallback to task-based (#90)
 - [ ] Prompt versioning — `prompt_version` + `model` columns in `tip_scores`; content-hash invalidation (#91)
 - [ ] LLM tip quality dashboard — reaction breakdown by model / prompt_version in `/admin/reward-analytics` (#92)
 **Evaluation & model selection:**
 - [ ] Model benchmark — compare qwen2.5:7b / llama3.2:3b / gemma3:4b via offline sim + LLM judge (#93)
 - [ ] LLM prompt research — persona design, context injection strategies, few-shot examples (#84)
 **Pipeline architecture:**
 - [ ] Signal source abstraction — `SignalSource` interface generalizing beyond Todoist (#78)
 - [ ] Generalized recommendation pipeline — candidate → rank → render stages (#80)
 - [ ] Feature registry + user profile builder — centralized features, persistent profiles (#81)
 - [ ] Tip kind system — task, advice, insight, reminder with kind-aware UI + rewards (#82)
 **Policy research:**
 - [ ] Next-gen policies — Thompson sampling, neural bandits, hybrid transfer learning (#83)
 **Integrations & infra (carried from M1):**
 - [ ] Apple OAuth (#7)
 - [x] NATS JetStream replacing in-process bus (#21) — adapter ships in `services/api/src/events/nats.ts`; in-proc bus is the producer, JetStream is the durable mirror
 - [x] Todoist sync via events (#22) — background scheduler in `services/api/src/signals/scheduler.ts` emits `signals.task.synced` every `TODOIST_SYNC_INTERVAL_MS`; on-demand fetch remains as freshness fallback
 - [ ] Event schema registry + protobuf CI gate (#54)
 - [ ] Per-user freshness SLAs for features (#61)
 - [ ] CI skeleton (#3), observability (#18), E2E tests (#20)
 **Bugs (fix before new features):**
 - [ ] TipFeedback type mismatch (#73)
 - [ ] Todoist token refresh (#74)
 - [ ] Reward fire-and-forget (#75)
 - [ ] Data retention purge (#76)
 - [ ] Port mismatch (#77)
 ### Phase 3 — Native mobile  *(M3)*
- [ ] iOS app (SwiftUI) with APNs push
+iOS (SwiftUI + APNs) and Android (Compose + FCM). `notifier` service gains APNs + FCM channels. Auth migrated from Auth.js to dedicated OIDC provider. Decide-and-deliver scheduler. See [M3 milestone](http://localhost:3000/alvis/oO/milestone/3).
 - [ ] Android app (Compose) with FCM push
 - [ ] `notifier` gains APNs + FCM channels, per-device rate limits
 - [ ] Migrate auth from Auth.js to dedicated OIDC provider (trigger from ADR-0004)
 - [ ] Consolidate MLflow + Airflow behind shared OIDC (SSO for all internal services)
 - [ ] Decide-and-deliver scheduler: per-user "is this tip worth interrupting now?" threshold
 ### Phase 4 — MLOps at scale  *(M4)*
- [x] Airflow + MLflow deployed as external services (`mlops` compose profile); each with own auth
+Retraining pipeline, feature-to-prompt batch jobs, prompt optimization loop, LLM fine-tuning on reaction signals, modular-monolith import-boundary lint, online experiments framework, drift monitoring. See [M4 milestone](http://localhost:3000/alvis/oO/milestone/4).
 - [ ] Write first retraining DAG (Airflow) + first MLflow experiment logging from `ml/serving`
 - [ ] Feature-to-prompt pipeline — nightly Airflow DAG materializes context for LLM; cuts inline latency (#94)
 - [ ] Prompt optimization loop — sim A/B → MLflow experiment → human-approved promotion (#95)
 - [ ] LLM fine-tuning — tip reactions as training signal; LoRA on base model; MLflow tracks runs (#96)
 - [ ] Embedding-based task clustering — `nomic-embed-text` for dedup + user pattern features (#97)
 - [ ] Consolidate MLflow + Airflow auth into shared OIDC provider (tracked as M3 issue #85)
 - [ ] Shadow → A/B → launch pipeline as first-class in MLflow
 - [ ] Online experiments framework: deterministic assignment + bandit policies alongside fixed-split A/B
 - [ ] Cross-user collaborative features (opt-in only); cohort slicing; fairness checks
 - [ ] Drift monitoring (feature + prediction + reward drift); model cards per LLM version
 ### Phase 5 — Production hardening  *(M5)*
- [ ] Audit logging, rotation of provider tokens + internal signing keys
+Audit logging, key rotation, k3s → k8s, multi-region, public integration SDK, billing. See [M5 milestone](http://localhost:3000/alvis/oO/milestone/5).
 - [ ] **k3s** on existing VM, then k8s + HPA once multi-node justified (no cliff)
 - [ ] Multi-region failover, Postgres PITR, event-bus mirroring
 - [ ] Public integration SDK; sandbox tenancy for third-party connectors
 - [ ] Billing + subscription tiers
 ---
 ## Contributing
-This repo is split into independent modules; most tickets belong to exactly one. Pick an issue, check its milestone (= phase), read the service's `README.md`, ship.
+This repo is split into independent modules; most tickets belong to exactly one. Pick an issue from [Gitea](http://localhost:3000/alvis/oO/issues), read the service's `README.md`, ship.
 Conventions and per-service guidance live in [`CLAUDE.md`](CLAUDE.md).
--- a/apps/admin/README.md
+++ b/apps/admin/README.md
@@ -8,16 +8,33 @@ Next.js 15 app. Deployed at `admin.o.alogins.net` (dev: `http://localhost:3080`)
  and checks `role === 'admin'`. First admin is seeded via `ADMIN_SEED_EMAIL` env var at API startup.
 - Admin write actions are appended to the `admin_actions` audit log in the DB.
 ## Authentication
 Two ways to sign in:
 | Method | How |
 |--------|-----|
 | Google OAuth | Click "Sign in with Google" on the login page |
 | Token | `POST /api/auth/token` with `{ token }` matching `ADMIN_TOKEN` env var; sets `sid` cookie valid for 24 h. Used by Playwright tests and CI automation. |
 ## Pages
 | Route | Description |
 |-------|-------------|
 | `/` | Overview: DAU/WAU KPI cards, tips served, reaction breakdown, activation funnel |
-| `/users` | User list (paginated) |
+| `/users` | User list (paginated, searchable) |
-| `/users/:id` | User detail: identity, consents, integrations, profile features (#81 phase B), tip stats, reward history; revoke-integration + reset-bandit + rebuild-profile actions |
+| `/users/:id` | User detail: identity, consents, integrations, profile features (completion rate, dismiss rate, dwell, preferred hour, tip volume), tip stats, reward history; revoke-integration + reset-bandit + rebuild-profile actions |
-| `/audit` | Admin action audit log |
+| `/audit` | Admin action audit log with timestamps and descriptions |
-| `/events` | Event stream viewer (stub — pending API history endpoint) |
+| `/events` | Live event stream viewer with filters by subject/user/time; tail of `signals.*` from ring buffer or NATS JetStream |
-| `/reward-analytics` | Reaction distribution + per-policy / per-model / per-prompt-version / per-tip-kind breakdowns with avg reward |
+| `/features` | Feature store browser: features sent to `ml/serving` per scoring call; freshness status; per-feature SLA tracking |
 | `/tips` | Served tips explorer: tip content, score, policy, model, feedback reactions; per-user timeline |
 | `/reward-analytics` | Reaction distribution + per-policy / per-model / per-prompt-version breakdowns with avg reward; time-series and cohort slicing |
 | `/data-quality` | Missing-feature rate heatmap, stale-token rate, daily completeness, per-feature freshness SLA status |
 | `/health` | System health rollup: api, ml/serving, SQLite, event-bus, MLflow with 15s auto-refresh |
 | `/sql` | Read-only SQL runner against SQLite; saved queries support; sunsets to Superset in M4 |
 | `/simulate` | Offline simulation runner: launch `ml/experiments/sim`, track runs, judge selection, policy comparison |
 | `/docs` | Admin documentation and ops runbooks inline |
 | `/ops` | Operational dashboard (deprecation candidate; pending UX refinement #107) |
 ## Dev
@@ -31,8 +48,9 @@ pnpm --filter @oo/admin dev   # starts on :3080
 Stays as a Next.js app in the monorepo permanently — it's not a candidate for extraction.
 It gets richer (more pages, embedded MLflow/Grafana) but not split.
-## Known issues
+## Known issues & pending improvements
 - `@tremor/react 3.x` declares a peer dep on React 18; the workspace uses React 19.
  Works in practice. Will resolve naturally when Tremor ships React 19 support or when
  we switch to Tremor v4 (which targets React 18+).
 - UX refinements pending (#100–102): feedback options consolidation, config page UI migration, settings UI placement
--- a/apps/admin/src/app/login/page.tsx
+++ b/apps/admin/src/app/login/page.tsx
@@ -1,15 +1,67 @@
 'use client';
 import { useState } from 'react';
 import { useRouter } from 'next/navigation';
 export default function LoginPage() {
  const router = useRouter();
  const [token, setToken] = useState('');
  const [error, setError] = useState('');
  const [loading, setLoading] = useState(false);
  async function handleTokenLogin(e: React.FormEvent) {
    e.preventDefault();
    setError('');
    setLoading(true);
    try {
      const res = await fetch('/api/auth/token', {
        method: 'POST',
        credentials: 'include',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ token }),
      });
      if (!res.ok) {
        const data = await res.json().catch(() => ({}));
        setError((data as { error?: string }).error ?? 'Invalid token');
        return;
      }
      router.push('/');
    } catch {
      setError('Request failed');
    } finally {
      setLoading(false);
    }
  }
  return (
    <div className="flex min-h-screen items-center justify-center">
-      <div className="text-center space-y-4">
+      <div className="text-center space-y-6 w-72">
        <h1 className="text-2xl font-semibold">oO Admin</h1>
-        <p className="text-gray-400 text-sm">Sign in via the main app first, then return here.</p>
+
        <a
          href="/sign-in"
          className="inline-block px-4 py-2 bg-white text-black rounded text-sm font-medium hover:bg-gray-200 transition-colors"
        >
          Sign in with Google
        </a>
        <form onSubmit={handleTokenLogin} className="space-y-3">
          <input
            type="password"
            placeholder="Admin token"
            value={token}
            onChange={(e) => setToken(e.target.value)}
            className="w-full px-3 py-2 bg-gray-900 border border-gray-700 rounded text-sm focus:outline-none focus:border-gray-500"
          />
          {error && <p className="text-red-400 text-xs">{error}</p>}
          <button
            type="submit"
            disabled={loading || !token}
            className="w-full px-4 py-2 bg-gray-700 text-white rounded text-sm font-medium hover:bg-gray-600 disabled:opacity-40 transition-colors"
          >
            {loading ? 'Signing in…' : 'Sign in with token'}
          </button>
        </form>
      </div>
    </div>
  );
--- a/apps/admin/src/app/ops/page.tsx
+++ b/apps/admin/src/app/ops/page.tsx
@@ -1,32 +1,17 @@
 'use client';
-import { useEffect, useState } from 'react';
+import { useState } from 'react';
 import { AdminShell } from '@/components/AdminShell';
-import { getPolicies, togglePolicy, replaySignal, PolicyInfo } from '@/lib/api';
+import { replaySignal } from '@/lib/api';
 const VALID_SUBJECTS = ['signals.tip.served', 'signals.tip.feedback', 'signals.task.synced'];
 export default function OpsPage() {
  const [policies, setPolicies] = useState<PolicyInfo[]>([]);
  const [replaySubject, setReplaySubject] = useState(VALID_SUBJECTS[0]);
  const [replayPayload, setReplayPayload] = useState('{\n  "userId": "",\n  "tipId": ""\n}');
  const [msg, setMsg] = useState('');
  const [error, setError] = useState('');
  useEffect(() => {
    getPolicies().then((r) => setPolicies(r.policies)).catch(() => {});
  }, []);
  const handleToggle = async (name: string, active: boolean) => {
    try {
      await togglePolicy(name, active);
      setPolicies((prev) => prev.map((p) => p.name === name ? { ...p, active } : p));
      setMsg(`Policy "${name}" ${active ? 'enabled' : 'disabled'}.`);
    } catch (e: any) {
      setError(e.message);
    }
  };
  const handleReplay = async () => {
    let payload: Record<string, unknown>;
    try {
@@ -47,32 +32,17 @@ export default function OpsPage() {
  return (
    <AdminShell>
      <div className="space-y-8">
-        <h1 className="text-xl font-semibold">Ops actions</h1>
+        <div>
          <h1 className="text-xl font-semibold">Ops</h1>
          <p className="text-sm text-gray-500 mt-1">
            Live system controls — replay past signals for backfill or debugging, and find
            per-user actions (token revoke) on the{' '}
            <a href="/users" className="text-indigo-400 hover:underline">Users page</a>.
          </p>
        </div>
        {msg && <p className="text-green-400 text-sm">{msg}</p>}
        {error && <p className="text-red-400 text-sm">{error}</p>}
        {/* Policy toggles */}
        <section className="space-y-3">
          <h2 className="text-base font-medium text-gray-300">Policies</h2>
          {policies.length === 0 ? (
            <p className="text-gray-500 text-sm">No shadow policies registered. Shadow policies can be added to the recommender source.</p>
          ) : (
            <div className="space-y-2">
              {policies.map((p) => (
                <div key={p.name} className="flex items-center justify-between bg-gray-900 border border-gray-800 rounded p-3">
                  <span className="text-sm text-gray-300 font-mono">{p.name}</span>
                  <button
                    onClick={() => handleToggle(p.name, !p.active)}
                    className={`px-3 py-1 rounded text-xs ${p.active ? 'bg-green-800 text-green-200' : 'bg-gray-800 text-gray-400'}`}
                  >
                    {p.active ? 'Active' : 'Disabled'}
                  </button>
                </div>
              ))}
            </div>
          )}
        </section>
        {/* Replay signal */}
        <section className="space-y-3">
          <h2 className="text-base font-medium text-gray-300">Replay signal</h2>
@@ -100,14 +70,6 @@ export default function OpsPage() {
          </div>
        </section>
        {/* User-level ops */}
        <section className="space-y-3">
          <h2 className="text-base font-medium text-gray-300">User-level actions</h2>
          <p className="text-sm text-gray-500">
            Revoke integration tokens and reset bandit state are available on the{' '}
            <a href="/users" className="text-indigo-400 hover:underline">Users page</a> — navigate to a user detail view.
          </p>
        </section>
      </div>
    </AdminShell>
  );
--- a/apps/admin/src/app/simulate/page.tsx
+++ b/apps/admin/src/app/simulate/page.tsx
@@ -0,0 +1,111 @@
 'use client';
 import { useEffect, useState } from 'react';
 import { AdminShell } from '@/components/AdminShell';
 import { getSimulationRuns, SimRun } from '@/lib/api';
 const mlflowBase = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
 function mlflowRunUrl(runId: string) {
  return `${mlflowBase}/#/experiments/1/runs/${runId}`;
 }
 function StatusBadge({ status }: { status: string }) {
  const cls: Record<string, string> = {
    running: 'bg-blue-900 text-blue-300 border-blue-800',
    done:    'bg-green-900 text-green-300 border-green-800',
    failed:  'bg-red-900 text-red-300 border-red-800',
    pending: 'bg-gray-800 text-gray-400 border-gray-700',
  };
  return (
    <span className={`text-xs px-2 py-0.5 rounded border ${cls[status] ?? cls.pending}`}>
      {status}
    </span>
  );
 }
 function SummaryRow({ run }: { run: SimRun }) {
  const summary = run.summaryJson ? JSON.parse(run.summaryJson) as Record<string, { total_reward: number; mean_reward: number; n_pulls: number }> : null;
  return (
    <div className="bg-gray-900 border border-gray-800 rounded p-4 space-y-2">
      <div className="flex items-center justify-between">
        <div className="space-y-0.5">
          <div className="flex items-center gap-2">
            <span className="font-mono text-xs text-gray-500">{run.id}</span>
            <StatusBadge status={run.status} />
            {run.winner && <span className="text-xs text-indigo-400">winner: {run.winner}</span>}
          </div>
          <div className="text-xs text-gray-600">
            {run.nUsers}u × {run.nRounds}r × {run.tasksPerRound}t/r — {run.judgeMode} judge
            {' · '}{new Date(run.createdAt).toLocaleString()}
          </div>
        </div>
        <div className="flex items-center gap-2 flex-shrink-0">
          {run.mlflowRunId && (
            <a href={mlflowRunUrl(run.mlflowRunId)} target="_blank" rel="noreferrer"
               className="text-xs text-indigo-400 hover:underline">MLflow ↗</a>
          )}
        </div>
      </div>
      {summary && (
        <div className="grid grid-cols-2 gap-2 pt-1 lg:grid-cols-3">
          {Object.entries(summary).map(([policy, s]) => (
            <div key={policy} className={`rounded border p-2 text-xs ${policy === run.winner ? 'border-indigo-700 bg-indigo-950' : 'border-gray-800'}`}>
              <div className="font-mono font-medium text-gray-300 mb-1">{policy}</div>
              <div className="text-gray-500 space-y-0.5">
                <div>total <span className="text-gray-300">{s.total_reward.toFixed(2)}</span></div>
                <div>mean <span className="text-gray-300">{s.mean_reward.toFixed(4)}</span></div>
                <div>pulls <span className="text-gray-300">{s.n_pulls}</span></div>
              </div>
            </div>
          ))}
        </div>
      )}
    </div>
  );
 }
 export default function SimulatePage() {
  const [runs, setRuns] = useState<SimRun[]>([]);
  const [loading, setLoading] = useState(true);
  const [error, setError] = useState('');
  const refresh = () =>
    getSimulationRuns()
      .then((r) => setRuns(r.runs))
      .catch((e) => setError(e.message))
      .finally(() => setLoading(false));
  useEffect(() => {
    refresh();
    const t = setInterval(refresh, 8_000);
    return () => clearInterval(t);
  }, []);
  return (
    <AdminShell>
      <div className="space-y-6 max-w-4xl">
        <div>
          <h1 className="text-xl font-semibold">Simulations</h1>
          <p className="text-sm text-gray-500 mt-1">
            Offline policy comparisons — trigger via the admin API or CLI. Results are logged to{' '}
            <a href={mlflowBase} target="_blank" rel="noreferrer" className="text-indigo-400 hover:underline">MLflow ↗</a>.
          </p>
        </div>
        {error && <p className="text-red-400 text-sm">{error}</p>}
        <section className="space-y-3">
          <h2 className="text-xs text-gray-500 uppercase tracking-widest font-medium">
            Run history
            {loading && <span className="text-gray-600 ml-2 normal-case">loading…</span>}
          </h2>
          {runs.length === 0 && !loading && (
            <p className="text-gray-600 text-sm">No simulation runs yet.</p>
          )}
          {runs.map((r) => <SummaryRow key={r.id} run={r} />)}
        </section>
      </div>
    </AdminShell>
  );
 }
--- a/apps/admin/src/components/AdminShell.tsx
+++ b/apps/admin/src/components/AdminShell.tsx
@@ -2,14 +2,15 @@
 import Link from 'next/link';
 import { usePathname } from 'next/navigation';
 import { useEffect, useState } from 'react';
 const mlflowUrl = process.env.NEXT_PUBLIC_MLFLOW_URL ?? '/mlflow';
 const airflowUrl = process.env.NEXT_PUBLIC_AIRFLOW_URL ?? '/airflow';
 type NavItem = {
  href: string;
  label: string;
  external?: boolean;
  svcName?: string; // key in the health services map
 };
 type NavSection = {
@@ -24,40 +25,59 @@ const NAV: NavSection[] = [
  {
    label: 'Signals',
    items: [
-      { href: '/users', label: 'Users' },
+      { href: '/users',        label: 'Users' },
-      { href: '/events', label: 'Events' },
+      { href: '/events',       label: 'Events' },
-      { href: '/features', label: 'Features' },
+      { href: '/features',     label: 'Features' },
      { href: '/data-quality', label: 'Data quality' },
    ],
  },
  {
-    label: 'Recommender status',
+    label: 'Recommender',
    items: [
-      { href: '/tips', label: 'Tips' },
+      { href: '/tips',             label: 'Tips' },
      { href: '/reward-analytics', label: 'Rewards' },
      { href: '/simulate',         label: 'Simulations' },
    ],
  },
  {
    label: 'Operations',
    items: [
      { href: '/health', label: 'Health' },
-      { href: '/ops', label: 'Ops' },
+      { href: '/ops',    label: 'Ops' },
-      { href: '/sql', label: 'SQL runner' },
+      { href: '/sql',    label: 'SQL runner' },
-      { href: '/audit', label: 'Audit log' },
+      { href: '/audit',  label: 'Audit log' },
    ],
  },
  {
    label: 'Resources',
    items: [
-      { href: '/docs', label: 'Docs' },
+      { href: '/docs',     label: 'Docs' },
-      { href: mlflowUrl, label: 'MLflow ↗', external: true },
+      { href: mlflowUrl, label: 'MLflow ↗', external: true, svcName: 'mlflow' },
      { href: airflowUrl, label: 'Airflow ↗', external: true },
    ],
  },
 ];
 const STATUS_DOT: Record<string, string> = {
  ok:       'bg-green-500',
  degraded: 'bg-yellow-400',
  down:     'bg-red-500',
 };
 export function AdminShell({ children }: { children: React.ReactNode }) {
  const pathname = usePathname();
  const [svcStatus, setSvcStatus] = useState<Record<string, string>>({});
  useEffect(() => {
    fetch('/api/admin/health', { credentials: 'include' })
      .then((r) => r.json())
      .then((data: { services?: { name: string; status: string }[] }) => {
        const map: Record<string, string> = {};
        for (const s of data.services ?? []) map[s.name] = s.status;
        setSvcStatus(map);
      })
      .catch(() => {});
  }, []);
  return (
    <div className="flex min-h-screen">
      {/* Sidebar */}
@@ -83,13 +103,19 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
                  const active =
                    !item.external &&
                    (item.href === '/' ? pathname === '/' : pathname.startsWith(item.href));
-                  const className = `flex items-center px-3 py-2 rounded text-sm transition-colors ${
+                  const className = `flex items-center gap-2 px-3 py-2 rounded text-sm transition-colors ${
                    active
                      ? 'bg-gray-800 text-white font-medium'
                      : item.external
                        ? 'text-gray-500 hover:text-white hover:bg-gray-900'
                        : 'text-gray-400 hover:text-white hover:bg-gray-900'
                  }`;
                  const dot = item.svcName
                    ? svcStatus[item.svcName]
                      ? <span className={`inline-block w-1.5 h-1.5 rounded-full flex-shrink-0 ${STATUS_DOT[svcStatus[item.svcName]] ?? STATUS_DOT.down}`} />
                      : <span className="inline-block w-1.5 h-1.5 rounded-full flex-shrink-0 bg-gray-700" />
                    : null;
                  return item.external ? (
                    <a
                      key={item.href}
@@ -98,6 +124,7 @@ export function AdminShell({ children }: { children: React.ReactNode }) {
                      rel="noreferrer"
                      className={className}
                    >
                      {dot}
                      {item.label}
                    </a>
                  ) : (
--- a/apps/admin/src/components/UsersTable.tsx
+++ b/apps/admin/src/components/UsersTable.tsx
@@ -37,7 +37,7 @@ export function UsersTable() {
        <table className="w-full text-sm">
          <thead className="bg-gray-900 border-b border-gray-800">
            <tr>
-              {['Email', 'Name', 'Role', 'Consent', 'Joined', 'Status'].map((h) => (
+              {['ID', 'Email', 'Name', 'Role', 'Consent', 'Joined', 'Status'].map((h) => (
                <th
                  key={h}
                  className="text-left px-4 py-2.5 text-xs text-gray-500 font-medium uppercase tracking-wide"
@@ -50,13 +50,13 @@ export function UsersTable() {
          <tbody className="divide-y divide-gray-800">
            {loading ? (
              <tr>
-                <td colSpan={6} className="px-4 py-6 text-center text-gray-500">
+                <td colSpan={7} className="px-4 py-6 text-center text-gray-500">
                  Loading…
                </td>
              </tr>
            ) : users.length === 0 ? (
              <tr>
-                <td colSpan={6} className="px-4 py-6 text-center text-gray-500">
+                <td colSpan={7} className="px-4 py-6 text-center text-gray-500">
                  No users yet.
                </td>
              </tr>
@@ -66,6 +66,9 @@ export function UsersTable() {
                  key={u.id}
                  className="hover:bg-gray-900 transition-colors cursor-pointer"
                >
                  <td className="px-4 py-2.5 text-gray-500 text-xs font-mono tabular-nums">
                    {u.id.slice(0, 8)}
                  </td>
                  <td className="px-4 py-2.5">
                    <Link href={`/users/${u.id}`} className="hover:underline text-indigo-400">
                      {u.email}
--- a/apps/admin/src/lib/api.ts
+++ b/apps/admin/src/lib/api.ts
@@ -91,10 +91,6 @@ export interface HealthStatus {
  services: { name: string; status: string; latencyMs: number }[];
 }
 export interface PolicyInfo {
  name: string;
  active: boolean;
 }
 export interface SavedQuery {
  id: string;
@@ -223,16 +219,6 @@ export function getHealth() {
  return apiFetch<HealthStatus>('/admin/health');
 }
 export function getPolicies() {
  return apiFetch<{ policies: PolicyInfo[] }>('/admin/policies');
 }
 export function togglePolicy(name: string, active: boolean) {
  return apiFetch<{ ok: boolean }>(`/admin/policies/${name}/toggle`, {
    method: 'POST',
    body: JSON.stringify({ active }),
  });
 }
 export function replaySignal(subject: string, payload: Record<string, unknown>) {
  return apiFetch<{ ok: boolean }>('/admin/replay-signal', {
@@ -262,3 +248,48 @@ export function saveQuery(name: string, querySql: string) {
 export function deleteSavedQuery(id: string) {
  return apiFetch<{ ok: boolean }>(`/admin/saved-queries/${id}`, { method: 'DELETE' });
 }
 // ── Simulations ────────────────────────────────────────────────────────────
 export interface SimRun {
  id: string;
  policyA: string;
  policyB: string;
  nUsers: number;
  nRounds: number;
  tasksPerRound: number;
  judgeMode: string;
  nPolicies: number;
  status: 'pending' | 'running' | 'done' | 'failed';
  summaryJson: string | null;
  winner: string | null;
  personaBreakdownJson: string | null;
  mlflowRunId: string | null;
  createdAt: string;
  finishedAt: string | null;
 }
 export interface SimStartRequest {
  nUsers?: number;
  nRounds?: number;
  tasksPerRound?: number;
  judgeMode?: 'rule' | 'llm';
  policies?: string[];
 }
 export function startSimulation(req: SimStartRequest) {
  return apiFetch<{ id: string; status: string }>(
    '/admin/simulate/start',
    { method: 'POST', body: JSON.stringify(req) },
  );
 }
 export function getSimulationRuns() {
  return apiFetch<{ runs: SimRun[] }>('/admin/simulate/runs');
 }
 export function getSimulationRun(id: string) {
  return apiFetch<{ run: SimRun & { isRunning: boolean }; events: unknown[] }>(
    `/admin/simulate/${id}`,
  );
 }
--- a/apps/admin/src/lib/docs.ts
+++ b/apps/admin/src/lib/docs.ts
@@ -13,8 +13,11 @@ import { readdir, readFile } from 'fs/promises';
 import path from 'path';
 import { marked } from 'marked';
-// apps/admin sits two levels below the monorepo root.
+// In development: process.cwd() = apps/admin/, so ../../docs = monorepo root docs/.
-const DOCS_ROOT = path.resolve(process.cwd(), '../../docs');
+// In Docker standalone: CWD = /app, so ../../docs is wrong. Set DOCS_ROOT in the
 // container to the absolute path where docs/ is copied (e.g. /app/docs).
 const DOCS_ROOT =
  process.env.DOCS_ROOT ?? path.resolve(process.cwd(), '../../docs');
 export type DocCategory = 'adr' | 'architecture';
--- a/apps/admin/src/middleware.ts
+++ b/apps/admin/src/middleware.ts
@@ -4,8 +4,8 @@ import type { NextRequest } from 'next/server';
 export async function middleware(req: NextRequest) {
  const { pathname } = req.nextUrl;
-  // Pass through the login page and API calls
+  // Pass through the login page, forbidden page, and API calls
-  if (pathname.startsWith('/login') || pathname.startsWith('/api/')) {
+  if (pathname.startsWith('/login') || pathname.startsWith('/forbidden') || pathname.startsWith('/api/')) {
    return NextResponse.next();
  }
--- a/apps/admin/tsconfig.tsbuildinfo
+++ b/apps/admin/tsconfig.tsbuildinfo
--- a/apps/web/src/app/config/page.tsx
+++ b/apps/web/src/app/config/page.tsx
@@ -0,0 +1,169 @@
 'use client';
 import { useEffect, useState, useCallback } from 'react';
 import { getVapidPublicKey, subscribePush, getOrchestatorPrefs, updateOrchestratorPref } from '@/lib/api';
 type PushState = 'idle' | 'subscribed' | 'denied';
 export default function ConfigPage() {
  const [pushState, setPushState] = useState<PushState>('idle');
  const [scienceDestiny, setScienceDestiny] = useState(50);
  const [prefSaving, setPrefSaving] = useState(false);
  useEffect(() => {
    getOrchestatorPrefs().then((prefs) => {
      if (typeof prefs.science_destiny === 'number') setScienceDestiny(prefs.science_destiny);
    }).catch(() => {});
  }, []);
  const handleScienceDestinyChange = useCallback(async (value: number) => {
    setScienceDestiny(value);
    setPrefSaving(true);
    try { await updateOrchestratorPref('science_destiny', value); }
    finally { setPrefSaving(false); }
  }, []);
  useEffect(() => {
    if (typeof Notification !== 'undefined') {
      if (Notification.permission === 'granted') setPushState('subscribed');
      else if (Notification.permission === 'denied') setPushState('denied');
    }
  }, []);
  const requestPush = useCallback(async () => {
    if (!('serviceWorker' in navigator) || !('PushManager' in window)) return;
    const permission = await Notification.requestPermission();
    if (permission !== 'granted') { setPushState('denied'); return; }
    try {
      const reg = await navigator.serviceWorker.register('/sw.js');
      const vapidKey = await getVapidPublicKey();
      const sub = await reg.pushManager.subscribe({
        userVisibleOnly: true,
        applicationServerKey: vapidKey,
      });
      await subscribePush(sub.toJSON());
      setPushState('subscribed');
    } catch { setPushState('denied'); }
  }, []);
  return (
    <main style={{ minHeight: '100vh', padding: '4rem 2rem', maxWidth: '480px', margin: '0 auto' }}>
      <div style={{ display: 'flex', alignItems: 'center', gap: '1rem', marginBottom: '3rem' }}>
        <a
          href="/tip"
          style={{ color: 'rgba(255,255,255,0.35)', fontSize: '0.85rem', textDecoration: 'none' }}
        >
          ← back
        </a>
        <h2 style={{ fontSize: '1.5rem', fontWeight: 300, margin: 0, letterSpacing: '-0.02em' }}>
          Settings
        </h2>
      </div>
      {/* Notifications */}
      <section style={{ marginBottom: '2.5rem' }}>
        <h3 style={{ fontSize: '0.75rem', letterSpacing: '0.12em', textTransform: 'uppercase', color: 'rgba(255,255,255,0.35)', marginBottom: '1rem', fontWeight: 400 }}>
          Notifications
        </h3>
        <div style={{
          border: '1px solid rgba(255,255,255,0.1)',
          borderRadius: '0.75rem',
          padding: '1.25rem 1.5rem',
          display: 'flex',
          alignItems: 'center',
          justifyContent: 'space-between',
        }}>
          <div>
            <div style={{ fontWeight: 400, fontSize: '0.9rem' }}>Push notifications</div>
            <div style={{ color: 'rgba(255,255,255,0.35)', fontSize: '0.75rem', marginTop: '0.2rem' }}>
              {pushState === 'subscribed' ? 'Enabled' : pushState === 'denied' ? 'Blocked by browser' : 'Get notified when a tip is ready'}
            </div>
          </div>
          {pushState === 'idle' && (
            <button
              onClick={requestPush}
              style={{
                background: 'var(--white)',
                color: 'var(--black)',
                border: 'none',
                borderRadius: '0.375rem',
                padding: '0.375rem 0.875rem',
                fontSize: '0.8rem',
                fontWeight: 500,
                cursor: 'pointer',
              }}
            >
              Enable
            </button>
          )}
          {pushState === 'subscribed' && (
            <span style={{ color: 'rgba(255,255,255,0.35)', fontSize: '0.8rem' }}>✓</span>
          )}
        </div>
      </section>
      {/* Tip style */}
      <section style={{ marginBottom: '2.5rem' }}>
        <h3 style={{ fontSize: '0.75rem', letterSpacing: '0.12em', textTransform: 'uppercase', color: 'rgba(255,255,255,0.35)', marginBottom: '1rem', fontWeight: 400 }}>
          Tip style
        </h3>
        <div style={{
          border: '1px solid rgba(255,255,255,0.1)',
          borderRadius: '0.75rem',
          padding: '1.25rem 1.5rem',
        }}>
          <div style={{ display: 'flex', justifyContent: 'space-between', alignItems: 'baseline', marginBottom: '0.875rem' }}>
            <span style={{ fontSize: '0.85rem', fontWeight: 500 }}>Science</span>
            <span style={{ fontSize: '0.7rem', color: 'rgba(255,255,255,0.25)' }}>
              {prefSaving ? 'saving…' : scienceDestiny === 50 ? 'balanced' : scienceDestiny < 50 ? 'data-driven' : 'intuitive'}
            </span>
            <span style={{ fontSize: '0.85rem', fontWeight: 500 }}>Destiny</span>
          </div>
          <input
            type="range"
            min={0}
            max={100}
            value={scienceDestiny}
            onChange={(e) => handleScienceDestinyChange(Number(e.target.value))}
            style={{ width: '100%', accentColor: 'var(--white)', cursor: 'pointer' }}
          />
          <div style={{ color: 'rgba(255,255,255,0.3)', fontSize: '0.7rem', marginTop: '0.75rem' }}>
            {scienceDestiny < 30
              ? 'Tips lean on patterns and data'
              : scienceDestiny > 70
              ? 'Tips lean on intuition and meaning'
              : 'Tips balance logic and intuition'}
          </div>
        </div>
      </section>
      {/* Integrations */}
      <section>
        <h3 style={{ fontSize: '0.75rem', letterSpacing: '0.12em', textTransform: 'uppercase', color: 'rgba(255,255,255,0.35)', marginBottom: '1rem', fontWeight: 400 }}>
          Integrations
        </h3>
        <a
          href="/connect"
          style={{
            display: 'flex',
            alignItems: 'center',
            justifyContent: 'space-between',
            border: '1px solid rgba(255,255,255,0.1)',
            borderRadius: '0.75rem',
            padding: '1.25rem 1.5rem',
            textDecoration: 'none',
            color: 'var(--white)',
          }}
        >
          <div>
            <div style={{ fontWeight: 400, fontSize: '0.9rem' }}>Connected apps</div>
            <div style={{ color: 'rgba(255,255,255,0.35)', fontSize: '0.75rem', marginTop: '0.2rem' }}>
              Manage Todoist and other sources
            </div>
          </div>
          <span style={{ color: 'rgba(255,255,255,0.35)', fontSize: '0.85rem' }}>→</span>
        </a>
      </section>
    </main>
  );
 }
--- a/apps/web/src/app/connect/page.tsx
+++ b/apps/web/src/app/connect/page.tsx
@@ -51,6 +51,8 @@ function ConnectPageInner() {
  }
  const todoistConnected = isConnected('todoist');
  const googleHealthConnected = isConnected('google-health');
  const anyConnected = todoistConnected || googleHealthConnected;
  return (
    <main style={{ minHeight: '100vh', padding: '4rem 2rem', maxWidth: '480px', margin: '0 auto' }}>
@@ -85,7 +87,6 @@ function ConnectPageInner() {
        marginBottom: '1rem',
      }}>
        <div style={{ display: 'flex', alignItems: 'center', gap: '0.875rem' }}>
          {/* Todoist logomark */}
          <svg width="28" height="28" viewBox="0 0 24 24" fill="none" aria-label="Todoist">
            <rect width="24" height="24" rx="6" fill="#DB4035"/>
            <path d="M6 8.5L11 13l7-7" stroke="#fff" strokeWidth="2" strokeLinecap="round" strokeLinejoin="round"/>
@@ -130,7 +131,65 @@ function ConnectPageInner() {
        )}
      </div>
-      {todoistConnected && (
+      {/* Google Health card */}
      <div style={{
        border: '1px solid rgba(255,255,255,0.1)',
        borderRadius: '0.75rem',
        padding: '1.25rem 1.5rem',
        display: 'flex',
        alignItems: 'center',
        justifyContent: 'space-between',
        marginBottom: '1rem',
      }}>
        <div style={{ display: 'flex', alignItems: 'center', gap: '0.875rem' }}>
          <svg width="28" height="28" viewBox="0 0 24 24" fill="none" aria-label="Google Health">
            <rect width="24" height="24" rx="6" fill="#EA4335"/>
            <path d="M12 6.5c0-1.1.9-2 2-2s2 .9 2 2-.9 2-2 2-2-.9-2-2z" fill="#fff"/>
            <path d="M8 10.5c0-1.1.9-2 2-2s2 .9 2 2-.9 2-2 2-2-.9-2-2z" fill="#fff" opacity=".7"/>
            <path d="M12 14.5c0 2.2-1.8 4-4 4s-4-1.8-4-4 1.8-4 4-4 4 1.8 4 4z" fill="#fff" opacity=".4"/>
            <path d="M13 13.5c.5-1 1.5-1.7 2.5-1.7 1.7 0 3 1.3 3 3s-1.3 3-3 3c-1 0-1.9-.5-2.5-1.3" stroke="#fff" strokeWidth="1.5" strokeLinecap="round" fill="none"/>
          </svg>
          <div>
            <div style={{ fontWeight: 500, fontSize: '0.9rem' }}>Google Health</div>
            <div style={{ color: 'var(--gray)', fontSize: '0.75rem', marginTop: '0.1rem' }}>
              {googleHealthConnected ? 'Connected' : 'Steps, sleep & activity'}
            </div>
          </div>
        </div>
        {googleHealthConnected ? (
          <button
            onClick={() => handleDisconnect('google-health')}
            disabled={disconnecting === 'google-health'}
            style={{
              background: 'transparent',
              border: '1px solid rgba(255,255,255,0.15)',
              color: 'var(--gray)',
              borderRadius: '0.375rem',
              padding: '0.375rem 0.875rem',
              fontSize: '0.8rem',
            }}
          >
            {disconnecting === 'google-health' ? '…' : 'Disconnect'}
          </button>
        ) : (
          <a
            href="/api/integrations/google-health/connect?redirectTo=/connect"
            style={{
              background: 'var(--white)',
              color: 'var(--black)',
              borderRadius: '0.375rem',
              padding: '0.375rem 0.875rem',
              fontSize: '0.8rem',
              fontWeight: 500,
            }}
          >
            Connect
          </a>
        )}
      </div>
      {anyConnected && (
        <div style={{ marginTop: '3rem' }}>
          <a
            href="/tip"
--- a/apps/web/src/app/tip/page.tsx
+++ b/apps/web/src/app/tip/page.tsx
@@ -1,12 +1,11 @@
 'use client';
 import { useEffect, useState, useRef, useCallback } from 'react';
-import { getRecommendation, sendFeedback, getVapidPublicKey, subscribePush } from '@/lib/api';
+import { getRecommendation, sendFeedback } from '@/lib/api';
 import type { Tip } from '@oo/shared-types';
 type State = 'loading' | 'tip' | 'empty' | 'actions' | 'done';
 // Fade wrapper — children fade in when `visible`, fade out when not
 function Fade({ visible, children, style }: {
  visible: boolean;
  children: React.ReactNode;
@@ -30,9 +29,8 @@ export default function TipPage() {
  const [visible, setVisible] = useState(false);
  const holdTimer = useRef<ReturnType<typeof setTimeout> | null>(null);
  const [pressed, setPressed] = useState(false);
-  const [pushState, setPushState] = useState<'idle' | 'subscribed' | 'denied'>('idle');
+  const [showReasoning, setShowReasoning] = useState(false);
  // Fade in after state change settles
  useEffect(() => {
    if (state === 'loading' || state === 'done') {
      setVisible(false);
@@ -42,16 +40,17 @@ export default function TipPage() {
    }
  }, [state]);
-  const loadTip = useCallback(async () => {
+  const loadTip = useCallback(async (recentTip?: string) => {
    setVisible(false);
    setState('loading');
    try {
-      const rec = await getRecommendation();
+      const rec = await getRecommendation(recentTip);
      if (!rec) {
        setState('empty');
        return;
      }
      setTip(rec.tip);
      setShowReasoning(false);
      setState('tip');
    } catch (err: any) {
      console.error('[tip] loadTip error', err?.status, err?.message);
@@ -61,42 +60,13 @@ export default function TipPage() {
  useEffect(() => { loadTip(); }, [loadTip]);
-  // Check existing push permission on mount
+  const react = async (action: 'done' | 'dismiss' | 'snooze') => {
  useEffect(() => {
    if (typeof Notification !== 'undefined' && Notification.permission === 'granted') {
      setPushState('subscribed');
    } else if (typeof Notification !== 'undefined' && Notification.permission === 'denied') {
      setPushState('denied');
    }
  }, []);
  const requestPush = useCallback(async () => {
    if (!('serviceWorker' in navigator) || !('PushManager' in window)) return;
    const permission = await Notification.requestPermission();
    if (permission !== 'granted') { setPushState('denied'); return; }
    try {
      const reg = await navigator.serviceWorker.register('/sw.js');
      const vapidKey = await getVapidPublicKey();
      const sub = await reg.pushManager.subscribe({
        userVisibleOnly: true,
        applicationServerKey: vapidKey,
      });
      await subscribePush(sub.toJSON());
      setPushState('subscribed');
    } catch { setPushState('denied'); }
  }, []);
  const react = async (action: 'done' | 'dismiss' | 'snooze' | 'helpful' | 'not_helpful') => {
    if (!tip) return;
-    const isNavigating = ['done', 'dismiss', 'snooze'].includes(action);
+    const snoozedContent = action === 'snooze' ? tip.content : undefined;
-    if (isNavigating) {
+    setVisible(false);
-      setVisible(false);
+    setState('done');
      setState('done');
    } else {
      setState('tip');
    }
    await sendFeedback(tip.id, { action });
-    if (isNavigating) setTimeout(() => loadTip(), 700);
+    setTimeout(() => loadTip(snoozedContent), 700);
  };
  const onPointerDown = () => {
@@ -119,7 +89,6 @@ export default function TipPage() {
  return (
    <>
      <style>{`
        @keyframes breathe {
          0%, 100% { opacity: 0.3; }
@@ -144,7 +113,7 @@ export default function TipPage() {
          overflow: 'hidden',
        }}
      >
-        {/* Ambient glow — breathes while loading */}
+        {/* Ambient glow */}
        <div style={{
          position: 'absolute',
          inset: 0,
@@ -192,24 +161,6 @@ export default function TipPage() {
            }}>
              hold to act
            </p>
            {pushState === 'idle' && (
              <button
                onClick={(e) => { e.stopPropagation(); requestPush(); }}
                style={{
                  marginTop: '2.5rem',
                  background: 'transparent',
                  border: 'none',
                  color: 'rgba(255,255,255,0.18)',
                  fontSize: '0.65rem',
                  letterSpacing: '0.12em',
                  textTransform: 'uppercase',
                  cursor: 'pointer',
                  padding: 0,
                }}
              >
                notify me
              </button>
            )}
          </Fade>
        )}
@@ -220,7 +171,7 @@ export default function TipPage() {
              All clear.
            </p>
            <button
-              onClick={loadTip}
+              onClick={() => loadTip()}
              style={{
                marginTop: '2rem',
                background: 'transparent',
@@ -242,12 +193,7 @@ export default function TipPage() {
          <>
            <div
              onClick={() => { setState('tip'); }}
-              style={{
+              style={{ position: 'fixed', inset: 0, background: 'rgba(0,0,0,0.5)' }}
                position: 'fixed',
                inset: 0,
                background: 'rgba(0,0,0,0.5)',
                animation: 'none',
              }}
            />
            <div style={{
              position: 'fixed',
@@ -260,8 +206,6 @@ export default function TipPage() {
              display: 'flex',
              flexDirection: 'column',
              gap: '0.75rem',
              transform: 'translateY(0)',
              transition: 'transform 0.3s ease',
            }}>
              {tip && (
                <p style={{
@@ -274,8 +218,6 @@ export default function TipPage() {
                </p>
              )}
              <ActionButton label="Done ✓" onClick={() => react('done')} primary />
              <ActionButton label="Helpful" onClick={() => react('helpful')} />
              <ActionButton label="Not helpful" onClick={() => react('not_helpful')} />
              <ActionButton label="Snooze" onClick={() => react('snooze')} />
              <ActionButton label="Dismiss" onClick={() => react('dismiss')} />
              <button
@@ -295,6 +237,102 @@ export default function TipPage() {
            </div>
          </>
        )}
        {/* Reasoning overlay */}
        {showReasoning && tip?.rationale && (
          <div
            onClick={(e) => { e.stopPropagation(); setShowReasoning(false); }}
            style={{
              position: 'fixed',
              inset: 0,
              display: 'flex',
              alignItems: 'flex-end',
              justifyContent: 'center',
              zIndex: 20,
              padding: '0 0 5rem',
            }}
          >
            <div
              onClick={(e) => e.stopPropagation()}
              style={{
                background: 'rgba(20,20,20,0.96)',
                border: '1px solid rgba(255,255,255,0.08)',
                borderRadius: '0.875rem',
                padding: '1.25rem 1.5rem',
                maxWidth: '360px',
                width: 'calc(100% - 3rem)',
              }}
            >
              <p style={{
                margin: 0,
                fontSize: '0.7rem',
                letterSpacing: '0.1em',
                textTransform: 'uppercase',
                color: 'rgba(255,255,255,0.3)',
                marginBottom: '0.625rem',
              }}>
                Why this tip
              </p>
              <p style={{
                margin: 0,
                fontSize: '0.9rem',
                fontWeight: 300,
                lineHeight: 1.5,
                color: 'rgba(255,255,255,0.75)',
              }}>
                {tip.rationale}
              </p>
            </div>
          </div>
        )}
        {/* ? button — bottom left, shows reasoning */}
        {(state === 'tip' || state === 'actions') && tip?.rationale && (
          <button
            onClick={(e) => { e.stopPropagation(); setShowReasoning((v) => !v); }}
            aria-label="Why this tip"
            style={{
              position: 'fixed',
              bottom: '1.5rem',
              left: '1.5rem',
              background: 'transparent',
              border: 'none',
              color: showReasoning ? 'rgba(255,255,255,0.5)' : 'rgba(255,255,255,0.15)',
              fontSize: '0.85rem',
              fontWeight: 400,
              lineHeight: 1,
              padding: '0.5rem',
              cursor: 'pointer',
              pointerEvents: 'auto',
              zIndex: 10,
              transition: 'color 0.2s ease',
              fontFamily: 'inherit',
            }}
          >
            ?
          </button>
        )}
        {/* Settings gear — bottom right */}
        <a
          href="/config"
          onClick={(e) => e.stopPropagation()}
          aria-label="Settings"
          style={{
            position: 'fixed',
            bottom: '1.5rem',
            right: '1.5rem',
            color: 'rgba(255,255,255,0.15)',
            fontSize: '1.1rem',
            lineHeight: 1,
            textDecoration: 'none',
            padding: '0.5rem',
            pointerEvents: 'auto',
            zIndex: 10,
          }}
        >
          ⚙
        </a>
      </main>
    </>
  );
--- a/apps/web/src/components/tests/TipPage.test.tsx
+++ b/apps/web/src/components/tests/TipPage.test.tsx
@@ -13,6 +13,8 @@ vi.mock('@/lib/api', () => ({
 import { getRecommendation, sendFeedback } from '@/lib/api';
 import TipPage from '@/app/tip/page';
 // jsdom doesn't support full anchor navigation — just verify the link exists
 const mockGetRec = getRecommendation as ReturnType<typeof vi.fn>;
 const mockSendFeedback = sendFeedback as ReturnType<typeof vi.fn>;
@@ -123,9 +125,20 @@ describe('TipPage — action sheet', () => {
    expect(mockSendFeedback).toHaveBeenCalledWith('tip:dis', { action: 'dismiss' });
  });
-  it('clicking "Helpful" calls sendFeedback with action=helpful (non-navigating)', async () => {
+  it('action sheet has exactly Done, Snooze, Dismiss — no Helpful/Not helpful', async () => {
-    await renderTipAndHold('tip:help', 'Helpful tip');
+    await renderTipAndHold('tip:actions', 'Check actions');
-    await act(async () => { fireEvent.click(screen.getByText('Helpful')); });
+    expect(screen.getByText('Done ✓')).toBeInTheDocument();
-    expect(mockSendFeedback).toHaveBeenCalledWith('tip:help', { action: 'helpful' });
+    expect(screen.getByText('Snooze')).toBeInTheDocument();
    expect(screen.getByText('Dismiss')).toBeInTheDocument();
    expect(screen.queryByText('Helpful')).not.toBeInTheDocument();
    expect(screen.queryByText('Not helpful')).not.toBeInTheDocument();
  });
  it('settings gear link is present on tip page', async () => {
    mockGetRec.mockResolvedValue({ tip: { id: 'tip:g', content: 'Gear test', source: 'todoist', createdAt: '' } });
    render(<TipPage />);
    await screen.findByText('Gear test');
    const link = screen.getByRole('link', { name: /settings/i });
    expect(link).toHaveAttribute('href', '/config');
  });
 });
--- a/apps/web/src/lib/api.ts
+++ b/apps/web/src/lib/api.ts
@@ -23,9 +23,12 @@ export async function getSession() {
  return apiFetch<{ user: { id: string; email: string; name?: string; image?: string } | null }>('/auth/session');
 }
-export async function getRecommendation(): Promise<RecommendResponse | null> {
+export async function getRecommendation(recentTip?: string): Promise<RecommendResponse | null> {
  try {
-    return await apiFetch<RecommendResponse>('/recommend', { method: 'POST' });
+    return await apiFetch<RecommendResponse>('/recommend', {
      method: 'POST',
      body: JSON.stringify(recentTip ? { recent_tip: recentTip } : {}),
    });
  } catch (e: any) {
    if (e.status === 204 || e.status === 422) return null;
    throw e;
@@ -81,3 +84,15 @@ export async function unsubscribePush(endpoint: string) {
    body: JSON.stringify({ endpoint }),
  });
 }
 export async function getOrchestatorPrefs(): Promise<Record<string, unknown>> {
  const data = await apiFetch<{ prefs: Record<string, Record<string, unknown>> }>('/profile');
  return data.prefs?.orchestrator ?? {};
 }
 export async function updateOrchestratorPref(key: string, value: unknown) {
  return apiFetch<{ ok: boolean }>('/profile/prefs/orchestrator', {
    method: 'PATCH',
    body: JSON.stringify({ [key]: value }),
  });
 }
--- a/docs/adr/0006-admin-console-framework.md
+++ b/docs/adr/0006-admin-console-framework.md
@@ -33,11 +33,10 @@ Same stack as `apps/web`. Reuses `packages/shared-types`, the Auth.js session co
 Specialized MLOps tooling runs as **separate external services** with their own auth, linked from the admin shell — not embedded or reimplemented:
 - **MLflow** → `https://o.alogins.net/mlflow` — experiment tracking, model registry, artifact browser; own basic-auth for now; see M3 for SSO consolidation
 - **Airflow** → `https://o.alogins.net/airflow` — batch pipeline orchestration, dataset management; own web-auth for now
 - **Grafana panels** → `/admin/infra` (iframed panels) — infra metrics
 - **Marimo notebooks** → launch-out link from admin
-The admin shell links to these services; clicking them opens a new tab. The `/experiments` and `/models` admin pages are hub pages with direct links to the relevant MLflow/Airflow views.
+The admin shell links to these services; clicking them opens a new tab.
 ### AuthZ
@@ -56,7 +55,7 @@ The admin shell links to these services; clicking them opens a new tab. The `/ex
 - One more Next.js app in the monorepo. Build/dev added to Turborepo.
 - Tremor + shadcn/ui are added as dependencies. shadcn components are copied into `apps/admin/src/components/ui/` — no runtime version coupling.
- MLflow (`o.alogins.net/mlflow*` → port 5000) and Airflow (`o.alogins.net/airflow*` → port 8080) are path-based routes in the existing `o.alogins.net` Caddy block, started via `docker compose --profile mlops up`.
+- MLflow (`o.alogins.net/mlflow*` → port 5000) is a path-based route in the existing `o.alogins.net` Caddy block, started via `docker compose --profile mlops up`.
- Each service manages its own auth (MLflow: built-in basic-auth; Airflow: built-in web UI auth). M3 will consolidate both behind the shared OIDC provider.
+- MLflow manages its own auth (built-in basic-auth). M3 will consolidate behind the shared OIDC provider.
- The `NEXT_PUBLIC_MLFLOW_URL` and `NEXT_PUBLIC_AIRFLOW_URL` build args in `Dockerfile.admin` default to the production URLs; override for dev builds.
+- The `NEXT_PUBLIC_MLFLOW_URL` build arg in `Dockerfile.admin` defaults to the production URL; override for dev builds.
 - `admin_actions` audit log grows unboundedly — needs a retention policy before M4.
--- a/docs/adr/0007-egreedy-v1-active-policy.md
+++ b/docs/adr/0007-egreedy-v1-active-policy.md
@@ -1,7 +1,7 @@
 # ADR-0007: ε-greedy v1 as the active recommendation policy
 ## Status
-Accepted — 2026-04-16
+Superseded by ADR-0013 — 2026-05-01
 ## Context
--- a/docs/adr/0012-egreedy-v2-profile-features.md
+++ b/docs/adr/0012-egreedy-v2-profile-features.md
@@ -1,7 +1,7 @@
 # ADR-0012 — ε-greedy v2: profile features in the bandit (D=7→12)
-**Status:** Accepted  
+**Status:** Superseded by ADR-0013 — 2026-05-01
-**Date:** 2026-04-25  
+**Date:** 2026-04-25 (accepted) / 2026-04-26 (promoted)  
 **Issue:** #99
 ## Context
@@ -106,3 +106,19 @@ projecting theta without the corresponding `A` matrix cannot be done correctly.
 the D=12 target in the issue spec and complicates the sim comparison. Deferred.
 **In-place v1 promotion without shadow** — violates ADR-0002.
 ## Promotion record (2026-04-26)
 Offline sim (`runner.py --policies egreedy-v1 egreedy-v2 --judge rule --n-users 5 --n-rounds 20 --seed 42`):
 | policy | total reward | mean reward | pulls |
 |--------|-------------|-------------|-------|
 | egreedy-v1 | −64.20 | −0.6420 | 100 |
 | egreedy-v2 | −62.90 | −0.6290 | 100 |
 **Gate passed** (v2 mean ≥ v1 mean). Per-persona: v2 wins deadline-driven, evening-relaxed, low-priority-first; v1 wins consistent-responder, overdue-ignorer.
 Changes applied:
 - `recommender.ts` `remotePolicy()`: `/score/egreedy` → `/score/egreedy/v2`
 - `recommender.ts` `sendRewardWithRetry()`: `/reward/egreedy` → `/reward/egreedy/v2`, added `profile_features` to payload
 - Shadow entry `egreedy-v2-shadow` left in registry (`active: false`) for rollback.
--- a/docs/adr/0013-multi-agent-recommendation.md
+++ b/docs/adr/0013-multi-agent-recommendation.md
@@ -0,0 +1,106 @@
 # ADR-0013 — Multi-agent recommendation: pre-computed agent snippets + orchestrator LLM
 **Status:** Accepted
 **Date:** 2026-05-01
 **Supersedes:** ADR-0007, ADR-0012
 ## Context
 The ε-greedy bandit (ADR-0007, promoted to v2 in ADR-0012) was the first recommendation
 policy. It served adequately during early M1 testing but carries structural problems that
 become more acute as the user base grows:
 - **Training signal sparsity.** The median user generates fewer than 5 reward signals per
  week. Ridge regression on a 12-dimensional feature vector needs far more signal than
  that to converge to a meaningful θ before the user loses interest.
 - **Cold-start cost.** Every new user starts with an uninformed identity matrix. Early tips
  are essentially random for the first weeks of use — precisely when first impressions
  matter most.
 - **Opacity.** The bandit cannot explain why it chose a tip. An orchestrator that reasons
  explicitly over named agent outputs ("3 overdue tasks + peak hour approaching") is
  interpretable by design.
 - **Coupling of generation and selection.** The current pipeline generates candidates, then
  scores them; the scoring is decoupled from the LLM reasoning. Giving the LLM the full
  pre-computed context directly is a simpler and more capable design.
 ## Decision
 Replace the RL bandit with a **multi-agent pipeline**:
 ### Sub-agents (async, pre-computed)
 Multiple domain-specialized Python agents each analyze user state from one angle and
 produce a **prompt snippet** — a short natural-language paragraph describing what they
 found. They do not produce tips. They run periodically (every 15 minutes) and store
 results in the new `agent_outputs` table with per-agent TTLs.
 Initial agent set:
 | Agent | ID | TTL |
 |---|---|---|
 | OverdueTaskAgent | `overdue-task` | 1h |
 | MomentumAgent | `momentum` | 6h |
 | TimeOfDayAgent | `time-of-day` | 15m |
 | RecentPatternsAgent | `recent-patterns` | 24h |
 | FocusAreaAgent | `focus-area` | 12h |
 ### Orchestrator agent (real-time)
 When a user requests a tip, the TypeScript recommender:
 1. Fetches all non-expired `agent_outputs` rows for the user.
 2. Calls `POST /recommend` on `ml/serving` with the snippet list.
 3. `ml/serving` assembles a single orchestrator prompt (template `v4-orchestrator`)
   that concatenates all snippets, then calls LiteLLM via the existing `tip-generator`
   alias to produce one tip.
 No bandit scoring. No reward delivery to an ML model. The LLM receives full context and
 generates the tip in one call.
 ### Feedback
 `tipFeedback` rows are still written on every user reaction. `inferReward()` still runs
 and `rewardMilli` is logged for observability and potential future supervised learning.
 Reactions are not delivered to an ML endpoint.
 ## New data model
 ```sql
 CREATE TABLE agent_outputs (
  id TEXT PRIMARY KEY,
  user_id TEXT NOT NULL REFERENCES users(id),
  agent_id TEXT NOT NULL,          -- e.g. 'overdue-task'
  prompt_text TEXT NOT NULL,       -- snippet produced by the agent
  signals_snapshot TEXT,           -- JSON: inputs the agent consumed
  computed_at TEXT NOT NULL,       -- ISO 8601
  expires_at TEXT NOT NULL,        -- ISO 8601 = computed_at + TTL
  agent_version TEXT NOT NULL      -- bump to invalidate cached outputs on logic changes
 );
 CREATE INDEX idx_agent_outputs_user_agent_exp
  ON agent_outputs(user_id, agent_id, expires_at DESC);
 ```
 ## Consequences
 ### Positive
 - Tips are explainable: `featuresJson` in `tipScores` records which agents contributed.
 - Cold-start is eliminated: the orchestrator reasons from signals immediately, no warm-up.
 - Adding or removing an agent is a self-contained change in `ml/agents/`.
 - Swapping LLM models remains a config change (LiteLLM alias unchanged).
 ### Negative / risks
 - **No automatic exploration.** The bandit would discover that a user prefers certain tip
  types without being told. The orchestrator only knows what the agents tell it.
  Mitigation: agents can evolve to encode richer signals; offline evaluation via the
  existing bench scripts remain available.
 - **Scheduler dependency.** If the pre-compute job falls behind, agent outputs go
  stale. Mitigation: the orchestrator falls back to raw signal prompt when no outputs
  exist; `TimeOfDayAgent` recomputes every 15 min to stay fresh.
 - **Higher per-request token cost.** The orchestrator prompt is longer than the old bandit
  prompt. Mitigation: the `tip-generator` alias points to a small local model; token cost
  is negligible at current scale.
 ## Migration sequence
 See plan document in conversation context. 10 steps; each independently deployable and
 rollback-able. Cutover is Step 6 (single TypeScript PR). Bandit endpoints removed in
 Step 7 after 48h clean traffic.
--- a/docs/adr/0014-unified-profile-and-agent-registry.md
+++ b/docs/adr/0014-unified-profile-and-agent-registry.md
@@ -0,0 +1,230 @@
 # ADR-0014 — Unified Profile model + agent registry
 **Status:** Proposed
 **Date:** 2026-05-05
 **Issues:** #30, #111, #112, #113, #114, #115, #116
 **Supersedes (data model):** ADR-0013 (the agent set stands; this ADR replaces the implicit assumption that prefs/contexts/consents are hardcoded on `users`).
 ## Context
 ADR-0013 introduced the multi-agent pipeline: N pre-compute agents emit
 prompt snippets, an orchestrator LLM assembles them into a tip. The ADR
 specified the `agent_outputs` table and the orchestrator contract, but
 left several questions open:
 1. **Where do user preferences live?** `users.consentGiven` is a single
   boolean. There is no place for quiet hours, tone, allowed tip kinds,
   or per-integration consent. Each new preference would mean another
   typed column on `users` — and worse, every new agent needs its own
   tunable parameters (focus areas, momentum baseline, lateness tolerance)
   that are clearly per-agent state, not global user state.
 2. **How are agents discovered?** The orchestrator currently iterates a
   hardcoded list. Adding an agent means touching the recommender, the
   admin UI, and the prefs schema in three places.
 3. **How does context (work / home / vacation) interact with agents?**
   Some agents should be silenced in some contexts. There is no model.
 4. **How is per-user agent configuration learned?** Issues #112–#116
   each want to auto-infer parameters (quiet hours, focus areas, etc.)
   from history. Without a shared substrate they each reinvent storage,
   recompute cadence, and cold-start fallback.
 The current ADR-0013 design works for five agents. It will not work for
 twenty without becoming a tangle.
 ## Decision
 Three changes, designed to compose:
 ### 1. Agents are plugins with declared schemas
 Every agent ships a manifest (Python, lives next to its code in
 `ml/agents/<id>/manifest.py`):
 ```python
 class AgentManifest:
    id: str                          # 'time-of-day'
    version: str                     # bump invalidates cached outputs + inferences
    pref_schema: dict                # JSON Schema for user-tunable knobs
    context_schema: list[str]        # signals it reads, e.g. ['todoist.tasks']
    required_consents: list[str]     # ['data:todoist', 'agent:time-of-day']
    output_contract: dict            # snippet shape (free text + optional tags)
    ttl_sec: int                     # snippet freshness for agent_outputs
    inferred_params: list[InferredParam]  # see §3
 ```
 The manifest is the **single point of registration**. The orchestrator,
 admin UI, and inference framework all read from it. Adding an agent is
 adding one directory in `ml/agents/` — no edits elsewhere.
 A `GET /api/agents/registry` endpoint (TS recommender → Python proxy)
 exposes manifests so the admin app can auto-render configuration UI from
 each `pref_schema`.
 ### 2. Unified Profile data model
 Three new tables replace the implicit "fields-on-users" pattern.
 `users.consentGiven` collapses into `user_consents` (one row,
 `consent_key='data:core'`); existing data migrates in a single
 backfill.
 ```sql
 -- Hybrid: typed columns where stable, KV where open-ended.
 -- Stable globals stay on users (added in this ADR):
 ALTER TABLE users ADD COLUMN tone TEXT;            -- 'direct'|'gentle'|'motivational'
 ALTER TABLE users ADD COLUMN tip_kinds_json TEXT;  -- JSON: allowed tip kinds
 -- Open-ended per-agent prefs land here:
 CREATE TABLE user_preferences (
  user_id TEXT NOT NULL REFERENCES users(id),
  scope   TEXT NOT NULL,    -- 'orchestrator' | 'agent:<id>'
  key     TEXT NOT NULL,    -- e.g. 'quietStart', 'focusAreas'
  value_json TEXT NOT NULL, -- agent validates against its pref_schema on read
  updated_at TEXT NOT NULL,
  source  TEXT NOT NULL DEFAULT 'user', -- 'user' | 'inferred'
  PRIMARY KEY (user_id, scope, key)
 );
 CREATE TABLE user_consents (
  user_id     TEXT NOT NULL REFERENCES users(id),
  consent_key TEXT NOT NULL,    -- 'data:todoist' | 'data:calendar' | 'agent:focus-area'
  granted_at  TEXT NOT NULL,
  revoked_at  TEXT,             -- null = currently active
  PRIMARY KEY (user_id, consent_key)
 );
 CREATE TABLE user_contexts (
  user_id    TEXT NOT NULL REFERENCES users(id),
  name       TEXT NOT NULL,    -- 'work' | 'home' | 'vacation' | user-named
  active     INTEGER NOT NULL DEFAULT 0, -- boolean
  schedule_json TEXT,          -- optional: when this context is active
  created_at TEXT NOT NULL,
  PRIMARY KEY (user_id, name)
 );
 ```
 Why hybrid (typed for stable globals, KV for per-agent):
 - `tone` and allowed tip kinds are referenced by every recommendation —
  putting them in JSON imposes a parse on every read.
 - Per-agent prefs are open-ended (each agent declares its own keys) and
  validated on read against the agent's `pref_schema`, so KV is correct.
 `user_preferences.source = 'user' | 'inferred'` keeps explicit user
 overrides distinguishable from inferred values (the inference framework
 never overwrites a `source='user'` row).
 `user_contexts` ships in this ADR with **manual toggle only**.
 Auto-inference per agent type is tracked in #112–#116; cross-agent
 calendar/geo inference is out of scope.
 ### 3. Shared context-inference framework
 Each `InferredParam` in a manifest declares:
 ```python
@dataclass
 class InferredParam:
    key: str                # 'quietStart'
    ttl_sec: int            # how often to recompute
    cold_start_default: Any # value used until enough history exists
    min_history: int        # event count threshold
    infer: Callable[[UserHistory], Any]  # pure function
 ```
 The framework (`ml/agents/inference/`) owns:
 - Scheduling (recomputes per-param via the existing pre-compute scheduler).
 - Reading history from `tip_views` / `tip_feedback` / `agent_outputs`.
 - Writing results to `user_preferences` with `source='inferred'`.
 - Cold-start: returns `cold_start_default` until `min_history` is met.
 - Versioning: bumping `agent.version` invalidates inferred rows for that agent.
 - Observability: structured log per recompute (window size, output diff, latency).
 Each per-agent issue (#112–#116) implements only its `infer()` functions;
 everything else is the framework.
 ## Read-through API
 Stays small as N grows because every endpoint is registry-driven:
 ```
 GET   /api/profile              → { user, prefs (grouped by scope), contexts, consents, agents[] }
 PATCH /api/profile/prefs/:scope → upserts user_preferences rows (source='user')
 PATCH /api/profile/consents     → grant/revoke
 PATCH /api/profile/contexts     → activate/deactivate / create
 GET   /api/agents/registry      → manifests; admin UI auto-renders forms from pref_schema
 ```
 `GET /api/profile` is the read-through used by `ml/serving` and the web
 client; it's the single endpoint each consumer calls instead of reading
 the DB directly.
 ## Orchestrator flow under this ADR
 ```
 1. Load Profile = { user, prefs, active context, consents } via /api/profile.
 2. From agent registry, filter eligible agents:
     - required consents granted
     - not silenced by active context (declared per-agent)
     - enabled in user_preferences (default: enabled)
 3. Pull latest non-expired agent_outputs for the eligible set.
 4. Build orchestrator prompt:
     - global prefs (tone, allowed tip kinds)
     - active context name as hint
     - agent snippets in eligibility order
 5. LLM → tip.
 ```
 No hardcoded agent list anywhere in the recommender. The orchestrator
 prompt template (`v4-orchestrator`) iterates whatever it was handed.
 ## Migration plan
 One PR per step; each independently deployable.
 1. **Schema** — add the three tables; add `tone` and `tip_kinds_json` to `users`.
 2. **Backfill** — write `users.consentGiven` rows into `user_consents` as `data:core`. Keep the column for one release, then drop.
 3. **Manifest plumbing** — `ml/agents/<id>/manifest.py` for the existing five; `GET /api/agents/registry` proxy.
 4. **Read-through API** — `/api/profile` + sub-endpoints.
 5. **Orchestrator cutover** — registry-driven eligibility filter.
 6. **Inference framework** (#111) — land it; migrate `time-of-day` (#112) as the proof.
 7. **Per-agent inference** — #113–#116 land independently against the framework.
 8. **Drop `users.consentGiven`** after one release.
 ## Consequences
 ### Positive
 - Adding an agent = one directory. Admin UI, prefs storage, consent
  storage, and inference all auto-pick-up.
 - Per-agent state lives next to the agent code; nothing global to edit.
 - User-controlled prefs and inferred prefs use the same storage but stay
  distinguishable (`source` column).
 - Consent revocation is row-level and time-stamped; aligns with the
  privacy stance in CLAUDE.md ("privacy is a feature, not a phase").
 - Sets up cleanly for #27 (Calendar) and #28 (Health) — they register
  their own consent keys without schema changes.
 ### Negative / risks
 - **JSON validation on read** for per-agent prefs is later than column
  typing. Mitigated by validating in the manifest's load function and
  failing closed (use cold-start default if invalid).
 - **Two-table reads** for the orchestrator (registry + profile + outputs)
  add latency. Cached profile read keeps it sub-ms in practice.
 - **Migration window** during which `users.consentGiven` and
  `user_consents` both exist. Reads must consult both for one release;
  writes go to `user_consents` only.
 - **Auto-inference can mislead.** A wrong-but-confident inferred quiet
  window silences the user when they want pings. Mitigation: every
  inferred param is overrideable in admin/settings (`source='user'`
  takes precedence), and inferences only kick in past their
  `min_history` threshold.
 ## What this does NOT change
 - ADR-0013's agent set, snippet contract, or `agent_outputs` table.
 - ADR-0011's `userProfileFeatures` (ML-derived features, not user prefs).
 - ADR-0008's LiteLLM gateway pattern.
 - The orchestrator prompt template name (`v4-orchestrator`); the assembly
  rule changes, the contract does not.
--- a/docs/adr/0015-data-source-consents.md
+++ b/docs/adr/0015-data-source-consents.md
@@ -0,0 +1,44 @@
 # ADR-0015 — Data-source consents only; drop per-agent consent gate
 **Date:** 2026-05-11  
 **Status:** Accepted  
 **Supersedes:** ADR-0014 §3 (consent model)
 ## Context
 ADR-0014 introduced `required_consents` on agent manifests. In practice two
 unrelated concepts were mixed into that field:
 - `data:<source>` — which data source the agent reads.
 - `agent:<id>` — whether the user opted into this specific agent.
 No UI ever granted `agent:<id>` consents, so the eligibility filter at
 `services/api/src/profile/eligibility.ts` dropped every agent for every real
 user. The symptom was confirmed by MLflow trace
 `tr-591449ea8a72af8e81b6a585234a86ab`: user `ODGp4Gkr7JWemMsqcMLMn` had five
 fresh `agent_outputs` rows but the orchestrator received `agent_ids: []`.
 ## Decision
 Collapse to a single consent dimension: **data source**.
 1. `required_consents` entries must all start with `data:`. Agent manifests no
   longer list `agent:<id>` entries.
 2. Connecting a data source via the OAuth flow automatically grants
   `data:<provider>` in `user_consents`. Disconnecting sets `revoked_at`.
 3. `data:core` continues to be auto-granted on signup.
 4. Per-agent control becomes a **preference** (`user_preferences[scope='agent:<id>', key='enabled']`), not a consent. The eligibility filter already honours this — the only change is removing the `agent:*` consent check that was always failing.
 5. Eligibility rule (final): an agent is eligible iff every `data:*` it
   declares is granted and not revoked, no active context is in
   `silenced_in_contexts`, and the `enabled` preference is not `false`.
 ## Consequences
 - Agents that only require `data:core` (time-of-day, momentum, recent-patterns)
  become eligible immediately after signup.
 - Agents requiring `data:todoist` or `data:google-health` become eligible as
  soon as the user connects the integration — no extra consent step.
 - A backfill migration grants `data:<provider>` for every existing active
  `integration_tokens` row, unblocking users who connected before this change.
 - `ml/agents/tests/test_manifest.py` asserts all `required_consents` start
  with `data:`, preventing regression.
--- a/docs/architecture/data-model.md
+++ b/docs/architecture/data-model.md
@@ -25,12 +25,37 @@ Session              auth
  expires_at
  revoked_at?
-Profile              profile
+User (extended)      profile                                ADR-0014
-  user_id (pk)
+  + tone                       'direct' | 'gentle' | 'motivational'
-  timezone
+  + tip_kinds_json             jsonb: allowed tip kinds (stable globals)
-  quiet_hours                  jsonb: [{start,end,days}]
+
-  contexts                     jsonb: [{name,predicate}]      introduced in Phase 2
+UserPreference       profile                                ADR-0014
-  consents                     jsonb: {integration: {read,write,retain_days}}
+  user_id, scope, key (pk)
  scope                        'orchestrator' | 'agent:<id>'
  value_json                   open-ended; agent validates against its pref_schema on read
  source                       'user' | 'inferred'           (inferred never overwrites user)
  updated_at
 UserConsent          profile                                ADR-0014
  user_id, consent_key (pk)
  consent_key                  'data:todoist' | 'data:calendar' | 'agent:focus-area' | ...
  granted_at
  revoked_at?                  null = currently active
 UserContext          profile                                ADR-0014
  user_id, name (pk)           'work' | 'home' | 'vacation' | user-named
  active                       manual toggle in M2; auto-inference per agent in #112-#116
  schedule_json?               optional: when this context is active
  created_at
 AgentOutput          recommender                            ADR-0013
  id (pk)
  user_id
  agent_id                     e.g. 'overdue-task' (matches a manifest)
  prompt_text                  snippet for the orchestrator prompt
  signals_snapshot             jsonb: inputs the agent consumed
  computed_at, expires_at      computed_at + manifest.ttl_sec
  agent_version                bump to invalidate cached outputs on logic changes
 Credential           integrations
  user_id
@@ -53,10 +78,10 @@ Event                events
 TipInstance          recommender
  tip_id (ulid)
  user_id
-  policy_name                  "random" | "bandit.linucb" | "remote:v3"
+  policy_name                  "v4-orchestrator" (ADR-0013) | legacy bandit names retained for history
  policy_version
-  candidate_source             "todoist" | "advice.library" | ...
+  candidate_source             "todoist" | "advice.library" | "agent-orchestrator" | ...
-  context_snapshot             jsonb: features seen at decision time
+  context_snapshot             jsonb: features + agent snippets seen at decision time
  tip                          jsonb: {kind,title,body,source,deep_link,meta}
  created_at
  shown_at?                    set when the client reports render
--- a/docs/architecture/overview.md
+++ b/docs/architecture/overview.md
@@ -47,8 +47,9 @@ User reactions (done / snooze / dismiss) are events too. They close the loop as
 - **OpenAPI** for HTTP; TS client auto-generated; Python pydantic hand-written while consumers are few.
 - **Feast** for feature store when we get there; homegrown adapter until then (Phase 1 seam).
 - **MLflow** for model registry and experiment tracking; deployed at `o.alogins.net/mlflow`.
 - **Airflow** for batch pipelines; deployed at `o.alogins.net/airflow`.
 - **Auth.js** embedded behind an OIDC-shaped boundary (ADR-0004). Swap to a standalone OIDC provider when mobile ships.
 - **Multi-agent recommendation** (ADR-0013) — pre-compute agents emit prompt snippets, an orchestrator LLM produces the tip. Replaced the ε-greedy bandit (ADR-0007/0012) for explainability, cold-start, and decoupling generation from selection.
 - **Registry-driven agents + unified Profile** (ADR-0014) — agents are plugins with declared manifests; per-user prefs, contexts, and per-key consents live in shared tables; auto-inferred parameters share a common framework. Adding an agent is a manifest change.
 - **k3s** as the first step beyond docker-compose — no "compose → full k8s" cliff.
 ## AI stack
@@ -60,30 +61,43 @@ All LLM inference routes through **LiteLLM** (`llm.alogins.net`) backed by **Oll
 **OpenWebUI** (`ai.alogins.net`) is the human-facing interface for prompt iteration and model testing during development.
-## Decision flow for a new tip (Phase 2 target)
+## Decision flow for a new tip (M2, ADR-0013 + ADR-0014)
 ```
                  ┌────────────────────────────────────────────────┐
                  │ Pre-compute (every 15 min, per registered agent) │
                  │  ml/agents/<id> → prompt snippet → agent_outputs │
                  │  TTL per manifest; agent_version invalidates     │
                  └────────────────────────────────────────────────┘
 client ─► gateway ─► recommender (TS)
                          │
                          ├─► profile:    GET /api/profile
                          │               (user, prefs, active context, consents)
                          │
                          ├─► registry:   GET /api/agents/registry
                          │               (manifests; eligibility filter inputs)
                          │
                          ├─► outputs:    pull freshest non-expired agent_outputs
                          │               for eligible agents (consents granted,
                          │               not silenced by active context, enabled)
                          │
                          ▼
                     ml/serving (Python)
                          │
-                          ├─► context:    ml/features/context.py
+                          ├─► assemble:   v4-orchestrator prompt
-                          │               (tasks + reactions + time patterns → prompt)
+                          │               = global prefs + active context + snippets
                          │
-                          ├─► generate:   LiteLLM → Ollama
+                          ├─► generate:   LiteLLM → Ollama → one tip
                          │               → N TipCandidates {content, kind, model, prompt_version}
                          │
-                          ├─► score:      bandit policy scores each candidate
+                          └─► persist:    tip_scores {tip, contributing agents,
-                          │
+                                          prompt_version, llm_model, latency}
-                          ├─► shadows:    shadow policies log picks without serving
+                          ◄─  tip
                          │
                          └─► persist:    tip_scores {candidate, policy, features, latency}
                          ◄─  best TipCandidate
 ```
-**Phase 1 (shipped M1):** candidates come from Todoist task list, no LLM. The bandit scores tasks directly.
+**Evolution:**
 - **Phase 1 (M1):** candidates from Todoist; ε-greedy bandit scored tasks directly (ADR-0007, ADR-0012). Superseded.
 - **Phase 2 early (M2):** LLM-generated candidates ranked by bandit. Superseded mid-milestone.
 - **Phase 2 current (M2):** multi-agent pipeline (ADR-0013), registry-driven and registry-extensible (ADR-0014). No bandit; the orchestrator LLM reasons over named agent snippets.
-**Phase 2 (shipped M2):** LLM candidates are generated in parallel with Todoist fetch. Both pools are merged, scored by the bandit, and the winner served. `tip_scores` tracks `prompt_version`, `llm_model`, and `tip_kind` for every row.
+Feedback: `POST /feedback → events.emit(reaction)`. No online ML reward loop (ADR-0013 §Consequences); reactions are logged in `tip_feedback` for observability and potential future supervised learning.
 Feedback: `POST /feedback → events.emit(reaction)` → online bandit update + `prompt_version` tracked for A/B analysis.
--- a/docs/architecture/privacy.md
+++ b/docs/architecture/privacy.md
@@ -26,7 +26,7 @@ User taps "Delete account" in settings → hard confirm → `User.deleted_at` se
 ## Scope boundaries
-Each integration declares the scopes it requests and the features it derives. The `Profile.consents` column is the source of truth; a scope removed from consent short-circuits derived-feature computation at the feature store.
+Each integration and each agent declares the consent keys it requires (`data:todoist`, `agent:focus-area`, ...) in its manifest. The `user_consents` table is the source of truth (per-key rows, revocation is a `revoked_at` write — never a delete, so audits stay clean). A revoked consent short-circuits derived-feature computation at the feature store and removes the dependent agent from the orchestrator's eligible set on the next tip. See ADR-0014.
 ## Audit
--- a/infra/docker/Dockerfile.admin
+++ b/infra/docker/Dockerfile.admin
@@ -1,32 +1,33 @@
-FROM node:22-alpine AS base
+# syntax=docker/dockerfile:1.7
 RUN npm install -g pnpm
-FROM base AS deps
+FROM node:22-slim AS base
-WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends \
-COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
+      python3 make g++ ca-certificates \
-COPY packages/shared-types/package.json ./packages/shared-types/
+ && rm -rf /var/lib/apt/lists/* \
-COPY apps/admin/package.json ./apps/admin/
+ && npm install -g pnpm
-RUN pnpm install --frozen-lockfile
+ENV CI=true \
    PNPM_HOME=/pnpm \
    PATH=/pnpm:$PATH
 RUN pnpm config set store-dir /pnpm/store
 FROM base AS builder
 WORKDIR /app
-COPY --from=deps /app/node_modules ./node_modules
+COPY pnpm-lock.yaml ./
-COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm fetch
-COPY --from=deps /app/apps/admin/node_modules ./apps/admin/node_modules
+COPY . .
-COPY tsconfig.base.json ./
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
-COPY packages/shared-types ./packages/shared-types
+    pnpm install --frozen-lockfile --offline \
-COPY apps/admin ./apps/admin
+      --filter @oo/admin... --filter @oo/shared-types
 RUN pnpm --filter @oo/shared-types build
 ARG NEXT_PUBLIC_MLFLOW_URL=/mlflow
 ARG NEXT_PUBLIC_AIRFLOW_URL=/airflow
 ENV NEXT_TELEMETRY_DISABLED=1 \
-    NEXT_PUBLIC_MLFLOW_URL=$NEXT_PUBLIC_MLFLOW_URL \
+    NEXT_PUBLIC_MLFLOW_URL=$NEXT_PUBLIC_MLFLOW_URL
    NEXT_PUBLIC_AIRFLOW_URL=$NEXT_PUBLIC_AIRFLOW_URL
 RUN pnpm --filter @oo/admin build
-FROM node:22-alpine AS runner
+FROM node:22-slim AS runner
-ENV NODE_ENV=production NEXT_TELEMETRY_DISABLED=1 PORT=3080
+ENV NODE_ENV=production NEXT_TELEMETRY_DISABLED=1 PORT=3080 DOCS_ROOT=/app/docs
 WORKDIR /app
 COPY --from=builder /app/apps/admin/.next/standalone ./
 COPY --from=builder /app/apps/admin/.next/static ./apps/admin/.next/static
 COPY --from=builder /app/docs ./docs
 CMD ["node", "apps/admin/server.js"]
--- a/infra/docker/Dockerfile.api
+++ b/infra/docker/Dockerfile.api
@@ -1,32 +1,35 @@
-FROM node:22-alpine AS base
+# syntax=docker/dockerfile:1.7
 RUN npm install -g pnpm
-FROM base AS deps
+FROM node:22-slim AS base
-WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends \
-COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
+      python3 make g++ ca-certificates \
-COPY packages/shared-types/package.json ./packages/shared-types/
+ && rm -rf /var/lib/apt/lists/* \
-COPY services/api/package.json ./services/api/
+ && npm install -g pnpm
-RUN pnpm install --frozen-lockfile
+ENV CI=true \
    PNPM_HOME=/pnpm \
    PATH=/pnpm:$PATH
 RUN pnpm config set store-dir /pnpm/store
 FROM base AS builder
 WORKDIR /app
-COPY --from=deps /app/node_modules ./node_modules
+COPY pnpm-lock.yaml ./
-COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm fetch
-COPY --from=deps /app/services/api/node_modules ./services/api/node_modules
+COPY . .
-COPY tsconfig.base.json ./
+RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
-COPY packages/shared-types ./packages/shared-types
+    pnpm install --frozen-lockfile \
-COPY services/api ./services/api
+      --filter @oo/api... --filter @oo/shared-types
 RUN pnpm --filter @oo/shared-types build
 RUN pnpm --filter @oo/api build
 RUN --mount=type=cache,id=pnpm,target=/pnpm/store \
    pnpm --filter @oo/api --prod deploy --legacy /deploy \
 && cp -r services/api/dist /deploy/dist \
 && rm -rf /deploy/node_modules/@oo/shared-types/src \
 && cp -r packages/shared-types/dist /deploy/node_modules/@oo/shared-types/dist
-FROM node:22-alpine AS runner
+FROM node:22-slim AS runner
 WORKDIR /app
-RUN npm install -g pnpm
+ENV NODE_ENV=production
-COPY package.json pnpm-workspace.yaml pnpm-lock.yaml* ./
+COPY --from=builder /deploy/package.json ./
-COPY packages/shared-types/package.json ./packages/shared-types/
+COPY --from=builder /deploy/node_modules ./node_modules
-COPY services/api/package.json ./services/api/
+COPY --from=builder /deploy/dist ./dist
 RUN pnpm install --prod --frozen-lockfile
 COPY --from=builder /app/packages/shared-types/dist ./packages/shared-types/dist
 COPY --from=builder /app/services/api/dist ./services/api/dist
 WORKDIR /app/services/api
 CMD ["node", "dist/index.js"]
--- a/infra/docker/Dockerfile.ml
+++ b/infra/docker/Dockerfile.ml
@@ -1,6 +1,11 @@
 FROM python:3.12-slim
-WORKDIR /app
+WORKDIR /app/ml/serving
 RUN apt-get update \
 && apt-get install -y --no-install-recommends build-essential \
 && rm -rf /var/lib/apt/lists/*
 COPY ml/serving/requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
-COPY ml/serving/main.py .
+COPY ml/ /app/ml/
 # PYTHONPATH=/app lets 'import ml.agents.*' resolve from /app/ml/agents/
 ENV PYTHONPATH=/app
 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
--- a/infra/docker/Dockerfile.web
+++ b/infra/docker/Dockerfile.web
@@ -13,6 +13,7 @@ WORKDIR /app
 COPY --from=deps /app/node_modules ./node_modules
 COPY --from=deps /app/packages/shared-types/node_modules ./packages/shared-types/node_modules
 COPY --from=deps /app/apps/web/node_modules ./apps/web/node_modules
 COPY package.json pnpm-workspace.yaml pnpm-lock.yaml ./
 COPY tsconfig.base.json ./
 COPY packages/shared-types ./packages/shared-types
 COPY apps/web ./apps/web
--- a/infra/docker/docker-compose.yml
+++ b/infra/docker/docker-compose.yml
@@ -11,12 +11,15 @@ services:
    env_file: ../../.env.local
    environment:
      NODE_ENV: production
      ML_SERVING_URL: "http://ml-serving:8000"
      MLFLOW_URL: "http://mlflow:5000"
      INTERNAL_API_TOKEN: "${INTERNAL_API_TOKEN:-}"
    volumes:
      - /mnt/ssd/dbs/oo:/mnt/ssd/dbs/oo
    ports:
      - "127.0.0.1:3078:3078"
    healthcheck:
-      test: ["CMD", "wget", "--spider", "-q", "http://localhost:3078/health"]
+      test: ["CMD", "node", "-e", "fetch('http://localhost:3078/health').then(r=>process.exit(r.ok?0:1)).catch(()=>process.exit(1))"]
      interval: 10s
      timeout: 5s
      retries: 5
@@ -49,6 +52,7 @@ services:
      PORT: "3080"
      HOSTNAME: "0.0.0.0"
      NEXT_PUBLIC_API_URL: ""
      NEXT_PUBLIC_MLFLOW_URL: "/mlflow"
      INTERNAL_API_URL: "http://api:3078"
    ports:
      - "127.0.0.1:3080:3080"
@@ -67,6 +71,7 @@ services:
    environment:
      LITELLM_URL: ${LITELLM_URL:-http://host.docker.internal:4000}
      OLLAMA_URL: ${OLLAMA_URL:-http://host.docker.internal:11434}
      MLFLOW_TRACKING_URI: ${MLFLOW_TRACKING_URI:-http://mlflow:5000}
    extra_hosts:
      - "host.docker.internal:host-gateway"
    ports:
@@ -77,89 +82,49 @@ services:
      timeout: 5s
      retries: 5
-  # ── mlops profile — MLflow + Airflow ──────────────────────────────────────
+  # ── ai profile — Ollama + LiteLLM for local dev ──────────────────────────
-  # Start: docker compose --profile mlops up
+  # Start: docker compose --profile ai up
-  # MLflow UI:  http://localhost:5000       or https://o.alogins.net/mlflow  (admin / password — change via basic_auth.ini)
+  # Use when the Agap shared Ollama/LiteLLM services are not available locally.
-  # Airflow UI: http://localhost:8080/airflow  or https://o.alogins.net/airflow  (admin / AIRFLOW_ADMIN_PASSWORD)
+  # Set LITELLM_URL=http://localhost:4000 and OLLAMA_URL=http://localhost:11434
-  # Caddy routes /mlflow* and /airflow* inside the o.alogins.net block
+  # in .env.local to point ml-serving at these containers instead of Agap.
-  airflow-db:
+  ollama:
-    image: postgres:16-alpine
+    image: ollama/ollama:latest
-    profiles: [mlops]
+    profiles: [ai]
    environment:
      POSTGRES_DB: airflow
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: ${AIRFLOW_DB_PASSWORD:-airflow}
    volumes:
-      - /mnt/ssd/dbs/oo/airflow-db:/var/lib/postgresql/data
+      - ollama-models:/root/.ollama
    ports:
      - "127.0.0.1:11434:11434"
    healthcheck:
-      test: ["CMD-SHELL", "pg_isready -U airflow"]
+      test: ["CMD", "curl", "-sf", "http://localhost:11434/api/tags"]
      interval: 15s
      timeout: 5s
      retries: 10
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    profiles: [ai]
    environment:
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY:-sk-local-dev}
    command: >
      --model ollama/qwen2.5:1.5b
      --model ollama/nomic-embed-text
      --api_base http://ollama:11434
      --port 4000
    ports:
      - "127.0.0.1:4000:4000"
    depends_on:
      ollama:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-sf", "http://localhost:4000/health"]
      interval: 10s
      timeout: 5s
      retries: 5
-  airflow-init:
+  # ── mlops profile — MLflow ────────────────────────────────────────────────
-    image: apache/airflow:2.9.3
+  # Start: docker compose --profile mlops up
-    profiles: [mlops]
+  # MLflow UI:  http://localhost:5000  or  https://o.alogins.net/mlflow
    entrypoint: /bin/bash
    command:
      - -c
      - |
        airflow db migrate
        airflow users create \
          --username admin \
          --firstname Admin \
          --lastname User \
          --role Admin \
          --email admin@oo.local \
          --password "$${AIRFLOW_ADMIN_PASSWORD:-admin}"
    environment:
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${AIRFLOW_DB_PASSWORD:-airflow}@airflow-db/airflow
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY:-change-me-in-prod}
      AIRFLOW__WEBSERVER__BASE_URL: ${AIRFLOW_BASE_URL:-https://o.alogins.net/airflow}
    depends_on:
      airflow-db:
        condition: service_healthy
    restart: "no"
  airflow-webserver:
    image: apache/airflow:2.9.3
    profiles: [mlops]
    command: webserver
    environment:
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${AIRFLOW_DB_PASSWORD:-airflow}@airflow-db/airflow
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW_SECRET_KEY:-change-me-in-prod}
      AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
      AIRFLOW__WEBSERVER__BASE_URL: ${AIRFLOW_BASE_URL:-https://o.alogins.net/airflow}
    volumes:
      - ../../ml/pipelines:/opt/airflow/dags:ro
    ports:
      - "127.0.0.1:8080:8080"
    depends_on:
      airflow-init:
        condition: service_completed_successfully
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 60s
  airflow-scheduler:
    image: apache/airflow:2.9.3
    profiles: [mlops]
    command: scheduler
    environment:
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:${AIRFLOW_DB_PASSWORD:-airflow}@airflow-db/airflow
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW_FERNET_KEY:-}
    volumes:
      - ../../ml/pipelines:/opt/airflow/dags:ro
    depends_on:
      airflow-init:
        condition: service_completed_successfully
  # ── events profile — NATS JetStream ─────────────────────────────────────
  # Start: docker compose --profile events up
@@ -182,25 +147,28 @@ services:
      retries: 5
  mlflow:
-    image: ghcr.io/mlflow/mlflow:v2.14.3
+    image: ghcr.io/mlflow/mlflow:v3.11.1
    profiles: [mlops]
    command: >
      mlflow server
      --backend-store-uri sqlite:////mlflow/mlflow.db
-      --default-artifact-root /mlflow/artifacts
+      --artifacts-destination /mlflow/artifacts
      --serve-artifacts
      --default-artifact-root mlflow-artifacts:/
      --host 0.0.0.0
      --port 5000
      --app-name basic-auth
      --static-prefix /mlflow
-    environment:
+      --allowed-hosts o.alogins.net,localhost,localhost:5000,mlflow,mlflow:5000
-      MLFLOW_AUTH_CONFIG_PATH: /mlflow/basic_auth.ini
+      --cors-allowed-origins https://o.alogins.net
    volumes:
      - /mnt/ssd/dbs/oo/mlflow:/mlflow
      - ../../infra/mlflow/basic_auth.ini:/mlflow/basic_auth.ini:ro
    ports:
      - "127.0.0.1:5000:5000"
    healthcheck:
-      test: ["CMD", "python", "-c", "import urllib.request,sys; sys.exit(0 if urllib.request.urlopen('http://localhost:5000/health',timeout=3).status==200 else 1)"]
+      test: ["CMD", "python", "-c", "import urllib.request,sys; sys.exit(0 if urllib.request.urlopen('http://localhost:5000/mlflow/health',timeout=3).status==200 else 1)"]
      interval: 10s
      timeout: 5s
      retries: 5
 volumes:
  ollama-models:
--- a/ml/README.md
+++ b/ml/README.md
@@ -4,9 +4,9 @@ Python. Owns models, features, training, online scoring.
 | Dir | Role | Phase |
 |---|---|---|
-| `serving/` | FastAPI online scorer (`/score`, `/generate`) + LiteLLM gateway + prompt registry (`prompts.py`), called by `recommender` | 1–2 |
+| `serving/` | FastAPI online scorer (`/score`, `/generate`) + LiteLLM gateway + prompt registry (`prompts.py`) + JetStream consumers for `signals.>` / `feedback.>`, called by `recommender` | 1–2 |
-| `features/` | context assembler (`context.py`): signals → `PromptContext`; Feast adapter later | 2 |
+| `features/` | context assembler (`context.py`): signals → `PromptContext`; profile-feature schema mirror (`profile_schema.py`); Feast adapter later | 2 |
-| `pipelines/` | batch feature + training DAGs (Prefect/Airflow) | 4 |
+| `pipelines/` | batch feature + training scripts | 4 |
 | `registry/` | MLflow-backed model registry integration | 4 |
 | `experiments/` | A/B assignment + multi-armed bandit policies | 4 |
 | `notebooks/` | research; never imported by production code | — |
@@ -18,14 +18,24 @@ Python. Owns models, features, training, online scoring.
 - Training reads from the offline feature store; serving reads from the online feature store; definitions are shared (no train/serve skew).
 - Shadow deploys before any policy change that affects real users.
-## Profile-feature contract
+## Feature contract
 ### Profile features (batched)
 User-level features (completion rate, preferred hour, tip volume…) are computed
-by the TypeScript recommender and shipped to ml/serving on every `/score` and
+by the TypeScript recommender and shipped to `ml/serving` on every `/score` and
 `/generate` call as `profile_features: dict | None`. The Python mirror in
-`features/profile_schema.py` documents the available names + dtypes — keep it
+`features/profile_schema.py` documents each feature's name, dtype, TTL, source,
-in sync with `services/api/src/profile/registry.ts` (a CI-style test asserts
+and null fallback — keep it in sync with `services/api/src/profile/registry.ts`
-the name sets match). See ADR-0011.
+(a CI-style test asserts names and `ttlSec` values match). See ADR-0011.
 ### Context features (JIT)
 Request-time signals assembled by `features/context.py` (`hour_of_day`,
 `day_of_week`, task list). These are never cached — they are derived from the
 system clock and the live Todoist feed at the moment of the score call.
 `CONTEXT_FEATURES` in `context.py` declares freshness, source, and fallback for
 each field (issue #61).
 ## Prompt registry
--- a/ml/init.py
+++ b/ml/init.py
--- a/ml/agents/init.py
+++ b/ml/agents/init.py
@@ -0,0 +1,4 @@
 from .base import BaseAgent, AgentInput, AgentOutput
 from .registry import get_agent, all_agents
 __all__ = ["BaseAgent", "AgentInput", "AgentOutput", "get_agent", "all_agents"]
--- a/ml/agents/base.py
+++ b/ml/agents/base.py
@@ -0,0 +1,61 @@
 """Base class and shared data structures for all recommendation sub-agents."""
 from __future__ import annotations
 from abc import ABC, abstractmethod
 from dataclasses import dataclass, field
 from datetime import datetime, timedelta, timezone
 from typing import ClassVar
@dataclass
 class AgentInput:
    """Everything an agent may need to produce its prompt snippet."""
    user_id: str
    tasks: list[dict]                          # task signal dicts (content, priority, is_overdue, …)
    profile: dict[str, float | None]           # profile feature values keyed by feature name
    feedback_history: list[dict] = field(default_factory=list)  # [{action, dwell_ms, created_at}, …]
    now: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
    # Per-agent inferred/user prefs loaded from user_preferences (ADR-0014 §3).
    # Keys match the agent's pref_schema + inferred_params. 'user' source takes
    # precedence over 'inferred' source; the caller resolves priority before
    # passing this dict in.
    agent_prefs: dict = field(default_factory=dict)
    # Pre-fetched enrichment cache: {content_hash -> description}. Populated by
    # the TS caller from the task_enrichments DB table to avoid redundant LLM calls.
    enrichment_cache: dict = field(default_factory=dict)
@dataclass
 class AgentOutput:
    """Result produced by an agent; persisted to agent_outputs table."""
    user_id: str
    agent_id: str
    prompt_text: str                           # snippet passed to the orchestrator
    signals_snapshot: dict                     # inputs consumed (for explainability / debugging)
    computed_at: str                           # ISO 8601
    expires_at: str                            # ISO 8601
    agent_version: str
 class BaseAgent(ABC):
    agent_id: ClassVar[str]
    ttl_seconds: ClassVar[int]
    version: ClassVar[str]
    @abstractmethod
    def compute(self, inp: AgentInput) -> AgentOutput:
        """Analyse inp and return a prompt snippet describing what was found."""
        ...
    def _make_output(self, inp: AgentInput, prompt_text: str, snapshot: dict) -> AgentOutput:
        computed_at = inp.now.astimezone(timezone.utc).isoformat()
        expires_at = (inp.now.astimezone(timezone.utc) + timedelta(seconds=self.ttl_seconds)).isoformat()
        return AgentOutput(
            user_id=inp.user_id,
            agent_id=self.agent_id,
            prompt_text=prompt_text,
            signals_snapshot=snapshot,
            computed_at=computed_at,
            expires_at=expires_at,
            agent_version=self.version,
        )
--- a/ml/agents/clustering.py
+++ b/ml/agents/clustering.py
@@ -0,0 +1,290 @@
 """Semantic task clustering via nomic-embed-text (issue #97, #129).
 Public API:
    cluster_tasks(tasks) -> list[Cluster]
 Each task dict must have a "content" key. Tasks without content are placed in a
 fallback "other" bucket. If the embedding service is unreachable, falls back to
 grouping by project_id so compute() always returns something useful.
 Pipeline (ported from taskpile experiments/clustering_eval, prompt v1):
  1. Expand each raw title via LiteLLM `tip-generator` (qwen2.5:1.5b) into a
     3-sentence description. Cached in-memory by content hash within a compute
     cycle so duplicate titles cost one LLM call.
  2. Prefix the expanded text with "clustering: " (nomic-embed-text task prefix).
  3. Batch-embed via LiteLLM `embedder` (nomic-embed-text).
  Falls back to embedding raw titles when LLM expansion fails, and to
  project-based grouping when embeddings are unavailable.
 """
 from __future__ import annotations
 import hashlib
 import logging
 import math
 import os
 from dataclasses import dataclass, field
 import httpx
 log = logging.getLogger(__name__)
 # Cosine similarity threshold for merging tasks into the same cluster.
 _SIM_THRESHOLD = 0.72
 # Never produce more than this many clusters regardless of task count.
 _MAX_CLUSTERS = 6
 _EMBED_TIMEOUT = 15.0
 _ENRICH_TIMEOUT = 30.0
 _ENRICH_PROMPT_V1 = (
    "You are helping categorize a personal task. "
    "Write exactly 3 sentences in English describing what the task likely involves, "
    "what context or skills it needs, and why it might matter. "
    "Be concise and specific. Do not use bullet points or numbering.\n"
    "Task: {title}\n"
    "Description:"
 )
@dataclass
 class Cluster:
    label: str                   # representative task content (shortest, most central)
    tasks: list[dict] = field(default_factory=list)
    @property
    def task_count(self) -> int:
        return len(self.tasks)
    @property
    def overdue_count(self) -> int:
        return sum(1 for t in self.tasks if t.get("is_overdue"))
 # ---------------------------------------------------------------------------
 # LLM enrichment
 # ---------------------------------------------------------------------------
 def _content_hash(text: str) -> str:
    return hashlib.md5(text.encode()).hexdigest()
 def _enrich_title(title: str, litellm_url: str) -> str | None:
    """Expand a terse task title into a 3-sentence description via LiteLLM."""
    try:
        with httpx.Client(trust_env=False, timeout=_ENRICH_TIMEOUT) as c:
            r = c.post(
                f"{litellm_url}/chat/completions",
                json={
                    "model": "tip-generator",
                    "messages": [{"role": "user", "content": _ENRICH_PROMPT_V1.format(title=title)}],
                    "max_tokens": 120,
                    "temperature": 0.3,
                },
            )
            r.raise_for_status()
            return r.json()["choices"][0]["message"]["content"].strip()
    except Exception as exc:
        log.debug("enrich_failed title=%r error=%s", title[:40], exc)
        return None
 def _enrich_batch(
    titles: list[str],
    persistent_cache: dict[str, str] | None = None,
 ) -> tuple[list[str], dict[str, str]]:
    """Return (descriptions, new_entries) for each title.
    Checks persistent_cache (pre-fetched from DB) first, then falls back to
    calling LiteLLM. new_entries contains only hashes generated this call —
    the caller should persist these to the DB.
    """
    litellm_url = os.getenv("LITELLM_URL")
    if not litellm_url:
        log.debug("enrich_batch: no LITELLM_URL, skipping enrichment")
        return titles, {}
    db_cache = persistent_cache or {}
    session_cache: dict[str, str] = {}  # dedup within this call
    new_entries: dict[str, str] = {}
    results = []
    for title in titles:
        h = _content_hash(title)
        if h in db_cache:
            results.append(db_cache[h])
        elif h in session_cache:
            results.append(session_cache[h])
        else:
            desc = _enrich_title(title, litellm_url)
            value = desc if desc else title
            session_cache[h] = value
            if desc:  # only persist successful enrichments
                new_entries[h] = desc
            results.append(value)
    return results, new_entries
 # ---------------------------------------------------------------------------
 # Embedding
 # ---------------------------------------------------------------------------
 def _embed_via_litellm(texts: list[str], litellm_url: str) -> list[list[float]] | None:
    """Batch embed via LiteLLM OpenAI-compatible /embeddings endpoint."""
    try:
        with httpx.Client(trust_env=False, timeout=_EMBED_TIMEOUT) as c:
            r = c.post(
                f"{litellm_url}/embeddings",
                json={"model": "embedder", "input": texts},
            )
            r.raise_for_status()
            data = r.json().get("data", [])
            ordered = sorted(data, key=lambda x: x["index"])
            return [item["embedding"] for item in ordered]
    except Exception as exc:
        log.debug("litellm_embed_failed error=%s", exc)
        return None
 def _embed_via_ollama(texts: list[str], ollama_url: str) -> list[list[float]] | None:
    """Batch embed via Ollama /api/embed endpoint."""
    try:
        results = []
        with httpx.Client(trust_env=False, timeout=_EMBED_TIMEOUT) as c:
            for text in texts:
                r = c.post(
                    f"{ollama_url}/api/embed",
                    json={"model": "nomic-embed-text", "input": text},
                )
                r.raise_for_status()
                body = r.json()
                # /api/embed returns {"embeddings": [[...]]}
                embeddings = body.get("embeddings")
                if not embeddings:
                    return None
                results.append(embeddings[0])
        return results
    except Exception as exc:
        log.debug("ollama_embed_failed error=%s", exc)
        return None
 def _embed_batch(texts: list[str]) -> list[list[float]] | None:
    """Embed a list of texts, preferring LiteLLM over direct Ollama."""
    litellm_url = os.getenv("LITELLM_URL")
    if litellm_url:
        vecs = _embed_via_litellm(texts, litellm_url)
        if vecs is not None:
            return vecs
        log.info("cluster: litellm embed failed, trying ollama fallback")
    ollama_url = os.getenv("OLLAMA_URL", "http://host.docker.internal:11434")
    return _embed_via_ollama(texts, ollama_url)
 # ---------------------------------------------------------------------------
 # Clustering
 # ---------------------------------------------------------------------------
 def _cosine(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a))
    nb = math.sqrt(sum(x * x for x in b))
    if na == 0 or nb == 0:
        return 0.0
    return dot / (na * nb)
 def _greedy_cluster(items: list[tuple[dict, list[float]]]) -> list[Cluster]:
    """Single-pass greedy clustering: each item joins the first existing cluster
    whose centroid is above _SIM_THRESHOLD, else starts a new one."""
    clusters: list[tuple[list[float], Cluster]] = []  # (centroid, cluster)
    for task, vec in items:
        best_idx = -1
        best_sim = _SIM_THRESHOLD - 1e-9
        for i, (centroid, _) in enumerate(clusters):
            sim = _cosine(centroid, vec)
            if sim > best_sim:
                best_sim = sim
                best_idx = i
        if best_idx >= 0 and len(clusters) < _MAX_CLUSTERS:
            centroid, cluster = clusters[best_idx]
            cluster.tasks.append(task)
            # Update centroid as running mean.
            n = len(cluster.tasks)
            new_centroid = [(c * (n - 1) + v) / n for c, v in zip(centroid, vec)]
            clusters[best_idx] = (new_centroid, cluster)
        elif len(clusters) < _MAX_CLUSTERS:
            label = task.get("content", "Tasks")[:60]
            cluster = Cluster(label=label, tasks=[task])
            clusters.append((vec, cluster))
        else:
            # Overflow: append to closest cluster even below threshold.
            best_i = max(range(len(clusters)), key=lambda i: _cosine(clusters[i][0], vec))
            clusters[best_i][1].tasks.append(task)
    return [c for _, c in clusters]
 def _fallback_by_project(tasks: list[dict]) -> list[Cluster]:
    """Group by project_id when embeddings are unavailable."""
    buckets: dict[str, Cluster] = {}
    for task in tasks:
        pid = task.get("project_id") or task.get("project") or "default"
        if pid not in buckets:
            label = pid if pid != "default" else "Tasks"
            buckets[pid] = Cluster(label=label)
        buckets[pid].tasks.append(task)
    return list(buckets.values())
 def cluster_tasks(
    tasks: list[dict],
    ollama_url: str | None = None,  # kept for test compatibility; env vars take precedence
    enrichment_cache: dict[str, str] | None = None,
 ) -> tuple[list[Cluster], dict[str, str]]:
    """Cluster tasks by semantic similarity.
    Returns (clusters, new_enrichments). new_enrichments contains LLM-generated
    descriptions produced this call that were not in the persistent cache — the
    caller should persist these. Falls back to project-based grouping if the
    embedding service is unavailable or tasks have no content.
    """
    if not tasks:
        return [], {}
    # Separate tasks with usable content from those without.
    with_content = [(t, t.get("content", "").strip()) for t in tasks]
    embeddable = [(t, c) for t, c in with_content if c]
    no_content = [t for t, c in with_content if not c]
    if not embeddable:
        return _fallback_by_project(tasks), {}
    task_objs = [t for t, _ in embeddable]
    raw_titles = [c for _, c in embeddable]
    # Step 1: LLM-enrich titles → richer semantic signal before embedding.
    descriptions, new_enrichments = _enrich_batch(raw_titles, persistent_cache=enrichment_cache)
    # Attach enriched description to each task dict so consumers (e.g. focus-area)
    # can show the expanded text instead of the terse raw title.
    for task, desc in zip(task_objs, descriptions):
        task["enriched_description"] = desc
    # Step 2: Prefix with nomic-embed-text task prefix, then batch-embed.
    prefixed = [f"clustering: {d}" for d in descriptions]
    vecs = _embed_batch(prefixed)
    if vecs is None or len(vecs) != len(prefixed):
        log.info("cluster_tasks: embedding unavailable, falling back to project grouping")
        return _fallback_by_project(tasks), new_enrichments
    embedded = list(zip(task_objs, vecs))
    clusters = _greedy_cluster(embedded)
    if no_content:
        clusters.append(Cluster(label="Other tasks", tasks=no_content))
    return clusters, new_enrichments
--- a/ml/agents/focus_area.py
+++ b/ml/agents/focus_area.py
@@ -0,0 +1,70 @@
 from __future__ import annotations
 from typing import ClassVar
 from .base import BaseAgent, AgentInput, AgentOutput
 from .clustering import cluster_tasks
 from .manifest import AgentManifest
 MANIFEST = AgentManifest(
    id="focus-area",
    version="3.0.0",  # output all clusters as context; no scoring (#129)
    description="Clusters tasks semantically, enriches titles via LLM, and outputs a full area summary with expanded descriptions for the orchestrator.",
    pref_schema={"type": "object", "additionalProperties": False, "properties": {}},
    context_schema=["todoist.tasks"],
    required_consents=["data:core", "data:todoist"],
    output_contract={"type": "snippet", "format": "free_text"},
    ttl_sec=86_400,
    inferred_params=[],
 )
 class FocusAreaAgent(BaseAgent):
    """Clusters tasks and outputs a full area summary for the orchestrator."""
    agent_id: ClassVar[str] = MANIFEST.id
    ttl_seconds: ClassVar[int] = MANIFEST.ttl_sec
    version: ClassVar[str] = MANIFEST.version  # 3.0.0
    def compute(self, inp: AgentInput) -> AgentOutput:
        if not inp.tasks:
            return self._make_output(
                inp,
                "No tasks available to identify focus areas.",
                {"cluster_count": 0},
            )
        clusters, new_enrichments = cluster_tasks(inp.tasks, enrichment_cache=inp.enrichment_cache)
        if not clusters:
            return self._make_output(
                inp,
                "No tasks available to identify focus areas.",
                {"cluster_count": 0},
            )
        lines = [f"The user's tasks are grouped into {len(clusters)} area(s):"]
        for i, cluster in enumerate(clusters, 1):
            descs = [
                t.get("enriched_description") or t.get("content", "")
                for t in cluster.tasks
                if t.get("content")
            ]
            descs = [d.strip() for d in descs if d.strip()]
            descs_str = "; ".join(f'"{d}"' for d in descs[:8])
            if len(descs) > 8:
                descs_str += f" (and {len(descs) - 8} more)"
            lines.append(f"{i}. {cluster.label} — {cluster.task_count} task(s): {descs_str}")
        lines.append("(Task titles may be in any language — always write the tip in English.)")
        snapshot = {
            "cluster_count": len(clusters),
            "clusters": [
                {"label": c.label, "task_count": c.task_count,
                 "tasks": [t.get("content", "") for t in c.tasks]}
                for c in clusters
            ],
            "_new_enrichments": new_enrichments,
        }
        return self._make_output(inp, "\n".join(lines), snapshot)
--- a/ml/agents/health_vitals.py
+++ b/ml/agents/health_vitals.py
@@ -0,0 +1,134 @@
 from __future__ import annotations
 from typing import ClassVar
 from .base import BaseAgent, AgentInput, AgentOutput
 from .manifest import AgentManifest, InferredParam
 from .inference.history import UserHistory
 def _infer_step_goal(history: UserHistory) -> int:
    """Return median daily step count as the personal goal baseline (min 1000)."""
    if not history.task_completions:
        return 7_000
    # task_completions reused as a generic history mechanism here;
    # step history arrives via agent_prefs.step_history when available.
    return 7_000
 MANIFEST = AgentManifest(
    id="health-vitals",
    version="1.0.0",
    description="Summarises today's health signals: steps, sleep, activity, and heart rate.",
    pref_schema={
        "type": "object",
        "additionalProperties": False,
        "properties": {
            "step_goal": {
                "type": "integer",
                "minimum": 1000,
                "default": 7000,
                "description": "Daily step goal.",
            },
            "sleep_goal_hours": {
                "type": "number",
                "minimum": 4,
                "maximum": 12,
                "default": 7,
                "description": "Target sleep duration in hours.",
            },
        },
    },
    context_schema=["google-health.steps", "google-health.sleep", "google-health.activity", "google-health.heart_rate"],
    required_consents=["data:core", "data:google-health"],
    output_contract={"type": "snippet", "format": "free_text"},
    ttl_sec=1800,  # refresh every 30 min — health data changes during the day
    silenced_in_contexts=[],
    inferred_params=[
        InferredParam(
            key="step_goal",
            ttl_sec=7 * 86_400,
            cold_start_default=7000,
            min_history=0,
            infer=lambda h: 7000,  # static default; override via user pref
        ),
    ],
 )
 class HealthVitalsAgent(BaseAgent):
    """Summarises today's health signals into an orchestrator prompt snippet."""
    agent_id: ClassVar[str] = MANIFEST.id
    ttl_seconds: ClassVar[int] = MANIFEST.ttl_sec
    version: ClassVar[str] = MANIFEST.version
    def compute(self, inp: AgentInput) -> AgentOutput:
        step_goal = int(inp.agent_prefs.get("step_goal", 7000))
        sleep_goal = float(inp.agent_prefs.get("sleep_goal_hours", 7.0))
        health = [t for t in inp.tasks if t.get("source") == "google-health"]
        if not health:
            prompt = "No health data available from Google Fit today. (Always write the tip in English.)"
            return self._make_output(inp, prompt, {"no_data": True})
        steps_sig = next((t for t in health if str(t.get("id", "")).endswith(":steps")), None)
        sleep_sig = next((t for t in health if str(t.get("id", "")).endswith(":sleep")), None)
        activity_sig = next((t for t in health if str(t.get("id", "")).endswith(":activity")), None)
        hr_sig = next((t for t in health if str(t.get("id", "")).endswith(":heart_rate")), None)
        insights: list[str] = []
        snapshot: dict = {}
        if steps_sig is not None:
            steps = int(steps_sig.get("step_count", 0))
            pct = round(steps / step_goal * 100) if step_goal else 0
            snapshot["step_count"] = steps
            snapshot["step_goal_pct"] = pct
            if pct < 30:
                insights.append(f"only {steps:,} steps today ({pct}% of {step_goal:,} goal — significantly behind)")
            elif pct < 60:
                insights.append(f"{steps:,} steps today ({pct}% of {step_goal:,} goal)")
            elif pct >= 100:
                insights.append(f"{steps:,} steps today (daily goal reached!)")
            else:
                insights.append(f"{steps:,} steps today ({pct}% of goal)")
        if sleep_sig is not None:
            hours = float(sleep_sig.get("sleep_hours", 0))
            deficit = max(0.0, sleep_goal - hours)
            snapshot["sleep_hours"] = hours
            snapshot["sleep_deficit_hours"] = deficit
            if deficit >= 1.5:
                insights.append(f"only {hours:.1f}h sleep last night ({deficit:.1f}h below the {sleep_goal:.0f}h goal)")
            elif deficit > 0:
                insights.append(f"{hours:.1f}h sleep last night (slightly below {sleep_goal:.0f}h goal)")
            else:
                insights.append(f"{hours:.1f}h sleep last night (goal met)")
        if activity_sig is not None:
            active_mins = int(activity_sig.get("active_minutes", 0))
            calories = int(activity_sig.get("calories_burned", 0))
            snapshot["active_minutes"] = active_mins
            snapshot["calories_burned"] = calories
            if active_mins < 10:
                insights.append(f"only {active_mins} active minutes today — largely sedentary")
            elif active_mins >= 30:
                insights.append(f"{active_mins} active minutes and {calories} kcal burned today")
        if hr_sig is not None:
            bpm = int(hr_sig.get("resting_bpm", 0))
            snapshot["resting_bpm"] = bpm
            if bpm > 90:
                insights.append(f"elevated resting heart rate: {bpm} bpm")
            elif bpm > 0:
                insights.append(f"resting heart rate: {bpm} bpm")
        if not insights:
            prompt = "Health data is available but no notable signals today. (Always write the tip in English.)"
        else:
            body = "; ".join(insights)
            prompt = f"Health snapshot: {body}. (Always write the tip in English.)"
        return self._make_output(inp, prompt, snapshot)
--- a/ml/agents/inference/init.py
+++ b/ml/agents/inference/init.py
@@ -0,0 +1,9 @@
 """Shared context-inference framework (ADR-0014 §3, issue #111).
 Each agent's manifest declares InferredParams; this package owns the
 scheduling contract, history data model, and write path to user_preferences.
 """
 from .framework import run_inference
 from .history import FeedbackEvent, TaskCompletion, UserHistory
 __all__ = ["run_inference", "FeedbackEvent", "TaskCompletion", "UserHistory"]
--- a/ml/agents/inference/framework.py
+++ b/ml/agents/inference/framework.py
@@ -0,0 +1,59 @@
 """run_inference — core of the context-inference framework (ADR-0014 §3).
 Contract:
    run_inference(manifest, history) → dict[key, value]
 Semantics:
  - For each InferredParam in manifest.inferred_params:
      - If len(history.events) < param.min_history → emit cold_start_default.
      - Otherwise → call param.infer(history) and emit the result.
  - Returns {key: value} ready for the caller to persist to user_preferences
    with source='inferred'.
  - User overrides (source='user') are handled by the caller's upsert logic;
    this function has no DB access.
 """
 from __future__ import annotations
 import logging
 import time
 from typing import Any
 from ..manifest import AgentManifest
 from .history import UserHistory
 log = logging.getLogger(__name__)
 def run_inference(manifest: AgentManifest, history: UserHistory) -> dict[str, Any]:
    """Evaluate all InferredParams for an agent and return {key: inferred_value}."""
    result: dict[str, Any] = {}
    n = len(history.events)
    for param in manifest.inferred_params:
        t0 = time.monotonic()
        if param.infer is None:
            result[param.key] = param.cold_start_default
            continue
        if n < param.min_history:
            value = param.cold_start_default
            source = "cold_start"
        else:
            try:
                value = param.infer(history)
                source = "inferred"
            except Exception as exc:
                log.warning(
                    "inference_error agent=%s param=%s error=%s — using cold_start_default",
                    manifest.id, param.key, exc,
                )
                value = param.cold_start_default
                source = "error_fallback"
        latency_ms = round((time.monotonic() - t0) * 1000, 1)
        log.info(
            "inference_param agent=%s param=%s source=%s value=%r history_len=%d latency_ms=%s",
            manifest.id, param.key, source, value, n, latency_ms,
        )
        result[param.key] = value
    return result
--- a/ml/agents/inference/history.py
+++ b/ml/agents/inference/history.py
@@ -0,0 +1,49 @@
 """UserHistory — normalised view of a user's feedback events for inference."""
 from __future__ import annotations
 from dataclasses import dataclass, field
 from datetime import datetime, timezone
@dataclass
 class FeedbackEvent:
    action: str              # 'done' | 'dismiss' | 'snooze' | 'helpful' | 'not_helpful'
    dwell_ms: int | None
    created_at: str          # ISO 8601
    @property
    def hour(self) -> int:
        """Hour of day (0-23) when the feedback was recorded."""
        try:
            dt = datetime.fromisoformat(self.created_at.replace("Z", "+00:00"))
        except ValueError:
            return 12
        if dt.tzinfo is None:
            dt = dt.replace(tzinfo=timezone.utc)
        return dt.hour
@dataclass
 class TaskCompletion:
    """A completed task that had a due date — used for lateness inference."""
    project_id: str | None
    completed_at: str   # ISO 8601
    due_at: str         # ISO 8601
    @property
    def lateness_days(self) -> float:
        """Days between due_at and completed_at. Negative = completed early."""
        try:
            def _parse(s: str) -> datetime:
                dt = datetime.fromisoformat(s.replace("Z", "+00:00"))
                return dt if dt.tzinfo else dt.replace(tzinfo=timezone.utc)
            return (_parse(self.completed_at) - _parse(self.due_at)).total_seconds() / 86_400
        except ValueError:
            return 0.0
@dataclass
 class UserHistory:
    user_id: str
    events: list[FeedbackEvent] = field(default_factory=list)
    task_completions: list[TaskCompletion] = field(default_factory=list)
--- a/ml/agents/manifest.py
+++ b/ml/agents/manifest.py
@@ -0,0 +1,70 @@
 """Agent manifest dataclass (ADR-0014).
 A manifest is the single point of registration for an agent. The orchestrator,
 admin UI, registry endpoint, and inference framework all read from it. Adding
 an agent is adding a manifest + agent class — never editing a list elsewhere.
 The manifest lives next to the agent code (each agent module in ml/agents/
 exposes a module-level `MANIFEST` constant). The registry surfaces both the
 agent instance and its manifest.
 """
 from __future__ import annotations
 from dataclasses import dataclass, field
 from typing import Any, Callable
@dataclass(frozen=True)
 class InferredParam:
    """One auto-inferred preference key (#111-#116).
    The inference framework owns scheduling, history reads, persistence, and
    cold-start. Each agent's `inferred_params` list declares what to infer and
    how, leaving each agent to implement just `infer()`.
    """
    key: str                                  # e.g. 'quietStart'
    ttl_sec: int                              # how often to recompute
    cold_start_default: Any                   # value used until min_history is met
    min_history: int                          # event count threshold
    # Pure function: given a UserHistory snapshot, return the inferred value.
    # Typed as a generic callable here; concrete signature lives in the framework.
    infer: Callable[[Any], Any] | None = None
@dataclass(frozen=True)
 class AgentManifest:
    """Declarative description of an agent — see ADR-0014 §1."""
    id: str                                   # 'time-of-day'
    version: str                              # bump invalidates cached outputs + inferences
    description: str                          # one-line human summary for admin UI
    pref_schema: dict                         # JSON Schema for user-tunable knobs
    context_schema: list[str]                 # signals it reads, e.g. ['todoist.tasks']
    required_consents: list[str]              # ['data:todoist', 'agent:time-of-day']
    output_contract: dict                     # snippet shape (free text + optional tags)
    ttl_sec: int                              # snippet freshness for agent_outputs
    silenced_in_contexts: list[str] = field(default_factory=list)  # active context names that suppress this agent
    inferred_params: list[InferredParam] = field(default_factory=list)
    def to_dict(self) -> dict:
        """Serialise for the registry endpoint. `inferred_params` drops `infer`
        (callable) since the wire format only carries metadata."""
        return {
            "id": self.id,
            "version": self.version,
            "description": self.description,
            "pref_schema": self.pref_schema,
            "context_schema": self.context_schema,
            "required_consents": self.required_consents,
            "output_contract": self.output_contract,
            "ttl_sec": self.ttl_sec,
            "silenced_in_contexts": list(self.silenced_in_contexts),
            "inferred_params": [
                {
                    "key": p.key,
                    "ttl_sec": p.ttl_sec,
                    "cold_start_default": p.cold_start_default,
                    "min_history": p.min_history,
                }
                for p in self.inferred_params
            ],
        }
--- a/ml/agents/momentum.py
+++ b/ml/agents/momentum.py
@@ -0,0 +1,249 @@
 from __future__ import annotations
 import math
 import statistics
 from collections import defaultdict
 from datetime import datetime, timedelta, timezone
 from typing import ClassVar
 from .base import BaseAgent, AgentInput, AgentOutput
 from .inference.history import UserHistory
 from .manifest import AgentManifest, InferredParam
 def _parse_dt(iso: str) -> datetime:
    try:
        dt = datetime.fromisoformat(iso.replace("Z", "+00:00"))
        if dt.tzinfo is None:
            dt = dt.replace(tzinfo=timezone.utc)
        return dt
    except ValueError:
        return datetime.min.replace(tzinfo=timezone.utc)
 def _daily_done_counts(history: UserHistory, window_days: int = 28) -> list[int]:
    """Count done-action events per calendar day over the last window_days days."""
    if not history.events:
        return []
    latest = max(_parse_dt(e.created_at) for e in history.events)
    cutoff = latest - timedelta(days=window_days)
    by_day: dict[tuple[int, int, int], int] = defaultdict(int)
    for e in history.events:
        if e.action == "done":
            dt = _parse_dt(e.created_at)
            if dt >= cutoff:
                by_day[(dt.year, dt.month, dt.day)] += 1
    # Return counts for every day in the window, including zero-completion days.
    counts = []
    for offset in range(window_days):
        day = (latest - timedelta(days=offset)).date()
        counts.append(by_day.get((day.year, day.month, day.day), 0))
    return counts
 def _infer_baseline_completions_per_day(history: UserHistory) -> float:
    counts = _daily_done_counts(history)
    return statistics.mean(counts) if counts else 1.0
 def _infer_stdev(history: UserHistory) -> float:
    counts = _daily_done_counts(history)
    if len(counts) < 2:
        return 1.0
    sd = statistics.stdev(counts)
    return max(sd, 0.1)  # floor so we never divide by zero in z-score
 def _infer_engagement_trend(history: UserHistory) -> str:
    """Compare done-rate in the most recent 7 days vs the 7 days before that."""
    events = sorted(history.events, key=lambda e: e.created_at)
    if not events:
        return "stable"
    try:
        latest = datetime.fromisoformat(events[-1].created_at.replace("Z", "+00:00"))
    except ValueError:
        return "stable"
    cutoff_recent = latest - timedelta(days=7)
    cutoff_older = latest - timedelta(days=14)
    recent = [e for e in events if _parse_dt(e.created_at) >= cutoff_recent]
    older = [e for e in events if cutoff_older <= _parse_dt(e.created_at) < cutoff_recent]
    if len(older) < 3:
        return "stable"
    recent_rate = sum(1 for e in recent if e.action == "done") / max(len(recent), 1)
    older_rate = sum(1 for e in older if e.action == "done") / max(len(older), 1)
    delta = recent_rate - older_rate
    if delta > 0.10:
        return "up"
    if delta < -0.10:
        return "down"
    return "stable"
 MANIFEST = AgentManifest(
    id="momentum",
    version="1.2.0",  # #114: baseline + stdev inferred params; z-score snippet language
    description="Characterises the user's recent engagement trend from profile features.",
    pref_schema={
        "type": "object",
        "additionalProperties": False,
        "properties": {
            "low_engagement_threshold_pct": {
                "type": "integer",
                "minimum": 0,
                "maximum": 100,
                "default": 25,
                "description": "Completion rate below which momentum hints at low engagement.",
            },
            "baseline_completions_per_day": {
                "type": "number",
                "minimum": 0,
                "default": 1.0,
                "description": "User's normal daily done-task rate (inferred from 28d history).",
            },
            "stdev": {
                "type": "number",
                "minimum": 0,
                "default": 1.0,
                "description": "Stdev of daily completion counts; used for z-score normalisation.",
            },
            "momentum_window": {
                "type": "integer",
                "minimum": 1,
                "default": 7,
                "description": "Days of recent history to measure current momentum against baseline.",
            },
        },
    },
    context_schema=["profile.features"],
    required_consents=["data:core"],
    output_contract={"type": "snippet", "format": "free_text"},
    ttl_sec=21_600,
    inferred_params=[
        InferredParam(
            key="engagement_trend",
            ttl_sec=21_600,
            cold_start_default="stable",
            min_history=10,
            infer=_infer_engagement_trend,
        ),
        InferredParam(
            key="baseline_completions_per_day",
            ttl_sec=7 * 86_400,
            cold_start_default=1.0,
            min_history=14,
            infer=_infer_baseline_completions_per_day,
        ),
        InferredParam(
            key="stdev",
            ttl_sec=7 * 86_400,
            cold_start_default=1.0,
            min_history=14,
            infer=_infer_stdev,
        ),
    ],
 )
 def _z_score_label(z: float) -> str | None:
    """Map z-score to a human-readable momentum label, or None if within normal range."""
    if z >= 2.0:
        return "well above your usual pace"
    if z >= 1.0:
        return "above your usual pace"
    if z <= -2.0:
        return "well below your usual pace"
    if z <= -1.0:
        return "below your usual pace"
    return None
 class MomentumAgent(BaseAgent):
    """Characterises the user's recent engagement trend from profile features."""
    agent_id: ClassVar[str] = MANIFEST.id
    ttl_seconds: ClassVar[int] = MANIFEST.ttl_sec
    version: ClassVar[str] = MANIFEST.version
    def compute(self, inp: AgentInput) -> AgentOutput:
        completion = inp.profile.get("completion_rate_30d")
        dismiss = inp.profile.get("dismiss_rate_30d")
        volume = inp.profile.get("tip_volume_30d")
        trend: str = inp.agent_prefs.get("engagement_trend", "stable")
        baseline: float = float(inp.agent_prefs.get("baseline_completions_per_day", 1.0))
        stdev: float = max(float(inp.agent_prefs.get("stdev", 1.0)), 0.1)
        window: int = int(inp.agent_prefs.get("momentum_window", 7))
        # Count done events in the recent window from feedback_history.
        now = inp.now.astimezone(timezone.utc)
        cutoff = now - timedelta(days=window)
        recent_done = sum(
            1 for e in inp.feedback_history
            if e.get("action") == "done" and _parse_dt(e.get("created_at", "")) >= cutoff
        )
        recent_rate = recent_done / window  # completions/day over the window
        z = (recent_rate - baseline) / stdev
        z_label = _z_score_label(z)
        parts: list[str] = []
        if completion is not None:
            pct = round(completion * 100)
            if pct >= 50:
                parts.append(f"The user completes {pct}% of tips (strong engagement).")
            elif pct >= 25:
                parts.append(f"The user completes {pct}% of tips (moderate engagement).")
            else:
                parts.append(
                    f"The user completes {pct}% of tips "
                    f"(low engagement — prefer simple, immediately actionable tips)."
                )
        else:
            parts.append("No completion-rate data yet (new user).")
        if dismiss is not None:
            dpct = round(dismiss * 100)
            if dpct >= 40:
                parts.append(f"Dismiss rate is high ({dpct}%) — avoid repetitive or irrelevant tips.")
            elif dpct <= 10:
                parts.append(f"Dismiss rate is low ({dpct}%).")
        if volume is not None and int(volume) < 5:
            parts.append("Very few tips served so far — this is an early-stage user.")
        # Z-score takes precedence over trend label when we have a baseline.
        if z_label:
            if z > 0:
                parts.append(
                    f"Completion pace is {z_label} "
                    f"({recent_done} done in the last {window}d vs "
                    f"~{baseline * window:.1f} expected) — build on the momentum."
                )
            else:
                parts.append(
                    f"Completion pace is {z_label} "
                    f"({recent_done} done in the last {window}d vs "
                    f"~{baseline * window:.1f} expected) — a motivational or easy-win tip may help."
                )
        elif trend == "up":
            parts.append("Engagement is trending up compared to last week — build on the momentum.")
        elif trend == "down":
            parts.append("Engagement is trending down — a motivational or easy-win tip may help.")
        prompt = " ".join(parts) if parts else "No engagement data available yet."
        snapshot = {
            "completion_rate_30d": completion,
            "dismiss_rate_30d": dismiss,
            "tip_volume_30d": volume,
            "engagement_trend": trend,
            "baseline_completions_per_day": baseline,
            "stdev": stdev,
            "momentum_window": window,
            "recent_done_count": recent_done,
            "z_score": round(z, 2),
        }
        return self._make_output(inp, prompt, snapshot)
--- a/ml/agents/overdue_task.py
+++ b/ml/agents/overdue_task.py
@@ -0,0 +1,165 @@
 from __future__ import annotations
 import statistics
 from typing import ClassVar
 from .base import BaseAgent, AgentInput, AgentOutput
 from .inference.history import UserHistory
 from .manifest import AgentManifest, InferredParam
 def _infer_lateness_tolerance(history: UserHistory) -> float:
    """p50 lateness (days) across completed tasks that had a due date, clipped at 0.
    Negative lateness (finished early) pulls the percentile down; we clip at 0
    so punctual users always get tolerance=0, never a negative offset.
    """
    lateness = [c.lateness_days for c in history.task_completions]
    if not lateness:
        return 0.0
    return max(0.0, statistics.median(lateness))
 def _infer_project_realness(history: UserHistory) -> dict[str, float]:
    """Per-project realness: 1 − (median project lateness / global median lateness).
    Projects whose tasks are consistently completed on time get realness ≈ 1.
    Aspirational projects (chronic lateness) get realness closer to 0.
    """
    completions = [c for c in history.task_completions if c.project_id]
    if not completions:
        return {}
    global_median = statistics.median(c.lateness_days for c in completions)
    if global_median <= 0:
        # Everyone finishes early — no project is less real than another.
        return {pid: 1.0 for pid in {c.project_id for c in completions}}  # type: ignore[misc]
    by_project: dict[str, list[float]] = {}
    for c in completions:
        by_project.setdefault(c.project_id, []).append(c.lateness_days)  # type: ignore[index]
    result: dict[str, float] = {}
    for pid, days in by_project.items():
        project_median = statistics.median(days)
        realness = 1.0 - (project_median / global_median)
        result[pid] = round(max(0.0, min(1.0, realness)), 3)
    return result
 MANIFEST = AgentManifest(
    id="overdue-task",
    version="1.2.0",  # #115: p50-lateness tolerance + per-project realness
    description="Reports the user's overdue tasks by count and age.",
    pref_schema={
        "type": "object",
        "additionalProperties": False,
        "properties": {
            "lateness_tolerance_days": {
                "type": "number",
                "minimum": 0,
                "default": 0,
                "description": "Days past due before a task is flagged. p50 of historical lateness.",
            },
            "project_realness": {
                "type": "object",
                "additionalProperties": {"type": "number", "minimum": 0, "maximum": 1},
                "default": {},
                "description": "Per-project realness score [0,1]. Low = aspirational due dates.",
            },
        },
    },
    context_schema=["todoist.tasks"],
    required_consents=["data:core", "data:todoist"],
    output_contract={"type": "snippet", "format": "free_text"},
    ttl_sec=3600,
    silenced_in_contexts=["vacation"],
    inferred_params=[
        InferredParam(
            key="lateness_tolerance_days",
            ttl_sec=7 * 86_400,      # recompute weekly — lateness habits shift slowly
            cold_start_default=0.0,
            min_history=10,
            infer=_infer_lateness_tolerance,
        ),
        InferredParam(
            key="project_realness",
            ttl_sec=7 * 86_400,
            cold_start_default={},
            min_history=10,
            infer=_infer_project_realness,
        ),
    ],
 )
 def _realness(project_id: str | None, project_realness: dict[str, float]) -> float:
    """Return realness for a project, defaulting to 1.0 (treat as real)."""
    if not project_id or not project_realness:
        return 1.0
    return project_realness.get(project_id, 1.0)
 def _format_task(task: dict, project_realness: dict[str, float]) -> str:
    content = task["content"]
    age = round(task.get("task_age_days", 0))
    pid = task.get("project_id")
    r = _realness(pid, project_realness)
    unit = "day" if age == 1 else "days"
    if r < 0.4:
        return f'"{content}" ({age} {unit} past target date)'
    return f'"{content}" ({age} {unit} overdue)'
 class OverdueTaskAgent(BaseAgent):
    """Reports the user's overdue tasks by count and age."""
    agent_id: ClassVar[str] = MANIFEST.id
    ttl_seconds: ClassVar[int] = MANIFEST.ttl_sec
    version: ClassVar[str] = MANIFEST.version
    def compute(self, inp: AgentInput) -> AgentOutput:
        tolerance = max(0.0, float(inp.agent_prefs.get("lateness_tolerance_days", 0)))
        project_realness: dict[str, float] = inp.agent_prefs.get("project_realness", {})
        overdue = [
            t for t in inp.tasks
            if t.get("is_overdue") and t.get("task_age_days", 0) >= tolerance
        ]
        top = sorted(overdue, key=lambda t: -t.get("task_age_days", 0))[:3]
        if not overdue:
            prompt = "The user has no overdue tasks at this time. (Always write the tip in English.)"
        elif len(overdue) == 1:
            t = top[0]
            r = _realness(t.get("project_id"), project_realness)
            item = _format_task(t, project_realness)
            if r < 0.4:
                prompt = f"The user has 1 task past its target date: {item}. (Task titles may be in any language — always write the tip in English.)"
            else:
                prompt = f"The user has 1 overdue task: {item}. (Task titles may be in any language — always write the tip in English.)"
        else:
            items = ", ".join(_format_task(t, project_realness) for t in top)
            avg_realness = (
                sum(_realness(t.get("project_id"), project_realness) for t in overdue)
                / len(overdue)
            )
            label = "tasks past their target dates" if avg_realness < 0.4 else "overdue tasks"
            prompt = (
                f"The user has {len(overdue)} {label}. "
                f"Top {len(top)}: {items}. (Task titles may be in any language — always write the tip in English.)"
            )
        snapshot = {
            "overdue_count": len(overdue),
            "lateness_tolerance_days": tolerance,
            "top_overdue": [
                {
                    "content": t["content"],
                    "task_age_days": t.get("task_age_days", 0),
                    "project_id": t.get("project_id"),
                    "realness": _realness(t.get("project_id"), project_realness),
                }
                for t in top
            ],
        }
        return self._make_output(inp, prompt, snapshot)
--- a/ml/agents/recent_patterns.py
+++ b/ml/agents/recent_patterns.py
@@ -0,0 +1,271 @@
 from __future__ import annotations
 import math
 from collections import Counter
 from datetime import datetime, timezone
 from typing import ClassVar
 from .base import BaseAgent, AgentInput, AgentOutput
 from .inference.history import UserHistory
 from .manifest import AgentManifest, InferredParam
 _DOW_NAMES = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
 def _parse_dt(iso: str) -> datetime:
    try:
        dt = datetime.fromisoformat(iso.replace("Z", "+00:00"))
        if dt.tzinfo is None:
            dt = dt.replace(tzinfo=timezone.utc)
        return dt
    except ValueError:
        return datetime.min.replace(tzinfo=timezone.utc)
 def _infer_lookback_days(history: UserHistory) -> int:
    """Find the minimum window (days) that captures ≥30 done events, capped at 30.
    Sorts done events newest-first, then measures the span to the 30th event.
    If fewer than 30 done events exist, returns 30 (use the full cap).
    """
    done = sorted(
        [e for e in history.events if e.action == "done"],
        key=lambda e: e.created_at,
        reverse=True,
    )
    if len(done) < 30:
        return 30
    latest = _parse_dt(done[0].created_at)
    thirtieth = _parse_dt(done[29].created_at)
    span = (latest - thirtieth).total_seconds() / 86_400
    return max(1, min(30, math.ceil(span)))
 def _infer_weekly_cycle(history: UserHistory) -> list[dict]:
    """Peak-to-mean ratio of done events per day-of-week (0=Monday … 6=Sunday).
    Returns all 7 DOW entries so the caller can filter by strength threshold.
    """
    by_dow: Counter[int] = Counter(
        _parse_dt(e.created_at).weekday()
        for e in history.events
        if e.action == "done"
    )
    total = sum(by_dow.values())
    if total == 0:
        return []
    mean = total / 7
    return [
        {
            "dow": dow,
            "strength": round(by_dow.get(dow, 0) / mean, 3),
            "sample": f"completes most {_DOW_NAMES[dow]}s",
        }
        for dow in range(7)
    ]
 def _infer_daily_cycle(history: UserHistory) -> list[dict]:
    """Peak-to-mean ratio of done events per hour-of-day (0–23).
    Returns entries for hours that have at least one done event.
    """
    by_hour: Counter[int] = Counter(
        _parse_dt(e.created_at).hour
        for e in history.events
        if e.action == "done"
    )
    total = sum(by_hour.values())
    if total == 0:
        return []
    mean = total / 24
    return [
        {
            "hour": hour,
            "strength": round(by_hour[hour] / mean, 3),
        }
        for hour in sorted(by_hour)
    ]
 MANIFEST = AgentManifest(
    id="recent-patterns",
    version="1.2.0",  # #116: lookback_days + weekly_cycle + daily_cycle inference
    description="Surfaces the user's reaction pattern from recent feedback.",
    pref_schema={
        "type": "object",
        "additionalProperties": False,
        "properties": {
            "lookback_days": {
                "type": "integer",
                "minimum": 1,
                "maximum": 30,
                "default": 7,
                "description": "Lookback window sized to capture ≥30 done events.",
            },
            "weekly_cycle": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "dow": {"type": "integer"},
                        "strength": {"type": "number"},
                        "sample": {"type": "string"},
                    },
                },
                "default": [],
                "description": "Per-DOW completion strength (peak-to-mean ratio).",
            },
            "daily_cycle": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "hour": {"type": "integer"},
                        "strength": {"type": "number"},
                    },
                },
                "default": [],
                "description": "Per-hour completion strength (peak-to-mean ratio).",
            },
        },
    },
    context_schema=["tip_feedback", "profile.features"],
    required_consents=["data:core"],
    output_contract={"type": "snippet", "format": "free_text"},
    ttl_sec=86_400,
    inferred_params=[
        InferredParam(
            key="lookback_days",
            ttl_sec=86_400,
            cold_start_default=7,
            min_history=5,
            infer=_infer_lookback_days,
        ),
        InferredParam(
            key="weekly_cycle",
            ttl_sec=86_400,
            cold_start_default=[],
            min_history=21,           # need ≥3 weeks to see a weekly signal
            infer=_infer_weekly_cycle,
        ),
        InferredParam(
            key="daily_cycle",
            ttl_sec=86_400,
            cold_start_default=[],
            min_history=14,
            infer=_infer_daily_cycle,
        ),
    ],
 )
 _STRENGTH_THRESHOLD = 0.5
 def _strong(entries: list[dict], key: str) -> list[dict]:
    return [e for e in entries if e.get("strength", 0) > _STRENGTH_THRESHOLD]
 def _hour_label(hour: int) -> str:
    if hour == 0:
        return "midnight"
    if hour < 12:
        return f"{hour}am"
    if hour == 12:
        return "noon"
    return f"{hour - 12}pm"
 class RecentPatternsAgent(BaseAgent):
    """Surfaces the user's reaction pattern from recent feedback."""
    agent_id: ClassVar[str] = MANIFEST.id
    ttl_seconds: ClassVar[int] = MANIFEST.ttl_sec
    version: ClassVar[str] = MANIFEST.version
    def compute(self, inp: AgentInput) -> AgentOutput:
        # Support legacy window_days pref key for backward compat.
        lookback_days = max(
            1,
            int(inp.agent_prefs.get("lookback_days", inp.agent_prefs.get("window_days", 7))),
        )
        weekly_cycle: list[dict] = inp.agent_prefs.get("weekly_cycle", [])
        daily_cycle: list[dict] = inp.agent_prefs.get("daily_cycle", [])
        window_s = lookback_days * 86_400
        now_ts = inp.now.timestamp()
        recent = [
            f for f in inp.feedback_history
            if self._age_s(f.get("created_at", ""), now_ts) <= window_s
        ]
        counts: Counter[str] = Counter(f.get("action") for f in recent)
        total = len(recent)
        dwell_ms = inp.profile.get("mean_dwell_ms_30d")
        parts: list[str] = []
        if total == 0:
            parts.append(f"No tip reactions recorded in the last {lookback_days} days.")
        else:
            done = counts.get("done", 0)
            dismissed = counts.get("dismiss", 0)
            snoozed = counts.get("snooze", 0)
            parts.append(
                f"Last {lookback_days} days: {total} tip reaction{'s' if total != 1 else ''} — "
                f"{done} completed, {dismissed} dismissed, {snoozed} snoozed."
            )
            if dwell_ms is not None:
                dwell_s = round(dwell_ms / 1000)
                if dwell_s < 15:
                    parts.append(
                        "Average dwell is very short — user may be acting on auto-pilot; vary tip content."
                    )
                elif dwell_s < 60:
                    parts.append(f"Average dwell {dwell_s}s — tips are being read.")
                else:
                    parts.append(
                        f"Average dwell {dwell_s}s — user deliberates; prefer tips that reward reflection."
                    )
        # Cycle hints — only when strength > threshold.
        strong_weekly = _strong(weekly_cycle, "strength")
        if strong_weekly:
            day_names = [_DOW_NAMES[e["dow"]] for e in strong_weekly]
            if len(day_names) == 1:
                parts.append(f"User tends to complete tips on {day_names[0]}s.")
            else:
                joined = ", ".join(day_names[:-1]) + f" and {day_names[-1]}"
                parts.append(f"User tends to complete tips on {joined}s.")
        strong_daily = _strong(daily_cycle, "strength")
        if strong_daily:
            hour_labels = [_hour_label(e["hour"]) for e in strong_daily]
            if len(hour_labels) == 1:
                parts.append(f"User is most active around {hour_labels[0]}.")
            else:
                joined = ", ".join(hour_labels[:-1]) + f" and {hour_labels[-1]}"
                parts.append(f"User is most active around {joined}.")
        prompt = " ".join(parts) if parts else "No engagement data available yet."
        snapshot = {
            "lookback_days": lookback_days,
            "recent_total": total,
            "action_counts": dict(counts),
            "mean_dwell_ms_30d": dwell_ms,
            "strong_weekly_days": [e["dow"] for e in strong_weekly],
            "strong_daily_hours": [e["hour"] for e in strong_daily],
        }
        return self._make_output(inp, prompt, snapshot)
    @staticmethod
    def _age_s(iso: str, now_ts: float) -> float:
        if not iso:
            return float("inf")
        try:
            dt = datetime.fromisoformat(iso.replace("Z", "+00:00"))
            if dt.tzinfo is None:
                dt = dt.replace(tzinfo=timezone.utc)
            return now_ts - dt.timestamp()
        except Exception:
            return float("inf")
--- a/ml/agents/registry.py
+++ b/ml/agents/registry.py
@@ -0,0 +1,64 @@
 """Agent registry — single point of registration for sub-agents (ADR-0014).
 Each agent module contributes:
  - a `BaseAgent` subclass instance
  - a module-level `MANIFEST: AgentManifest`
 The orchestrator, registry endpoint, and inference framework all read from
 here. Adding an agent is: add a module, register it once below.
 """
 from __future__ import annotations
 from .base import BaseAgent
 from .manifest import AgentManifest
 from .overdue_task import OverdueTaskAgent, MANIFEST as OVERDUE_TASK_MANIFEST
 from .momentum import MomentumAgent, MANIFEST as MOMENTUM_MANIFEST
 from .time_of_day import TimeOfDayAgent, MANIFEST as TIME_OF_DAY_MANIFEST
 from .recent_patterns import RecentPatternsAgent, MANIFEST as RECENT_PATTERNS_MANIFEST
 from .focus_area import FocusAreaAgent, MANIFEST as FOCUS_AREA_MANIFEST
 from .health_vitals import HealthVitalsAgent, MANIFEST as HEALTH_VITALS_MANIFEST
 from .tarot import TarotAgent, MANIFEST as TAROT_MANIFEST
 from .stars import StarsAgent, MANIFEST as STARS_MANIFEST
 _REGISTERED: list[tuple[BaseAgent, AgentManifest]] = [
    (OverdueTaskAgent(), OVERDUE_TASK_MANIFEST),
    (MomentumAgent(), MOMENTUM_MANIFEST),
    (TimeOfDayAgent(), TIME_OF_DAY_MANIFEST),
    (RecentPatternsAgent(), RECENT_PATTERNS_MANIFEST),
    (FocusAreaAgent(), FOCUS_AREA_MANIFEST),
    (HealthVitalsAgent(), HEALTH_VITALS_MANIFEST),
    (TarotAgent(), TAROT_MANIFEST),
    (StarsAgent(), STARS_MANIFEST),
 ]
 # Sanity check — agent_id and manifest.id must agree, otherwise the registry
 # becomes inconsistent across endpoints.
 for _agent, _manifest in _REGISTERED:
    if _agent.agent_id != _manifest.id:
        raise RuntimeError(
            f"Manifest mismatch: {_agent.__class__.__name__}.agent_id={_agent.agent_id!r} "
            f"≠ MANIFEST.id={_manifest.id!r}"
        )
 _AGENTS: dict[str, BaseAgent] = {a.agent_id: a for a, _ in _REGISTERED}
 _MANIFESTS: dict[str, AgentManifest] = {m.id: m for _, m in _REGISTERED}
 def get_agent(agent_id: str) -> BaseAgent:
    if agent_id not in _AGENTS:
        raise KeyError(f"Unknown agent: {agent_id!r}. Known: {sorted(_AGENTS)}")
    return _AGENTS[agent_id]
 def all_agents() -> list[BaseAgent]:
    return list(_AGENTS.values())
 def get_manifest(agent_id: str) -> AgentManifest:
    if agent_id not in _MANIFESTS:
        raise KeyError(f"Unknown agent: {agent_id!r}. Known: {sorted(_MANIFESTS)}")
    return _MANIFESTS[agent_id]
 def all_manifests() -> list[AgentManifest]:
    return list(_MANIFESTS.values())
--- a/ml/agents/stars.py
+++ b/ml/agents/stars.py
@@ -0,0 +1,233 @@
 """Stars agent — astrological transit predictions via pyswisseph.
 Requires birth_date in agent_prefs (ISO 8601 date string, e.g. '1990-06-15').
 Populated from a connected data source (Google profile / Google Health).
 If birth_date is absent the agent returns a no-data snippet and the
 eligibility filter will silence it once the consent / pref check catches up.
 Computes today's Sun, Moon, Mercury, Venus, Mars, Jupiter, Saturn positions
 and finds notable transits (conjunctions, oppositions, squares, trines, sextiles)
 between today's sky and the user's natal chart. Passes a concise prediction
 + interpretation to the orchestrator.
 """
 from __future__ import annotations
 import math
 from datetime import date, datetime, timezone
 from typing import ClassVar
 from .base import BaseAgent, AgentInput, AgentOutput
 from .manifest import AgentManifest, InferredParam
 try:
    import swisseph as swe  # type: ignore
    _SWE_AVAILABLE = True
 except ImportError:  # pragma: no cover — present in container, absent in dev
    _SWE_AVAILABLE = False
 # ---------------------------------------------------------------------------
 # Planet catalogue
 # ---------------------------------------------------------------------------
 _PLANETS: list[tuple[int, str]] = []
 if _SWE_AVAILABLE:
    _PLANETS = [
        (swe.SUN,     "Sun"),
        (swe.MOON,    "Moon"),
        (swe.MERCURY, "Mercury"),
        (swe.VENUS,   "Venus"),
        (swe.MARS,    "Mars"),
        (swe.JUPITER, "Jupiter"),
        (swe.SATURN,  "Saturn"),
    ]
 # Aspect definitions: (angle, orb, name, nature)
 _ASPECTS: list[tuple[float, float, str, str]] = [
    (0.0,   8.0, "conjunction",  "intensifying"),
    (60.0,  6.0, "sextile",      "harmonious"),
    (90.0,  7.0, "square",       "challenging"),
    (120.0, 8.0, "trine",        "flowing"),
    (180.0, 8.0, "opposition",   "tension"),
 ]
 _ZODIAC = [
    "Aries", "Taurus", "Gemini", "Cancer", "Leo", "Virgo",
    "Libra", "Scorpio", "Sagittarius", "Capricorn", "Aquarius", "Pisces",
 ]
 # Interpretive keywords per planet for transit readings
 _PLANET_THEMES: dict[str, str] = {
    "Sun":     "identity, vitality, core purpose",
    "Moon":    "emotions, intuition, comfort needs",
    "Mercury": "communication, thinking, decisions",
    "Venus":   "relationships, values, pleasure",
    "Mars":    "energy, drive, conflict",
    "Jupiter": "growth, opportunity, expansion",
    "Saturn":  "discipline, responsibility, long-term structure",
 }
 def _zodiac_sign(lon: float) -> str:
    return _ZODIAC[int(lon / 30) % 12]
 def _jd_from_date(d: date) -> float:
    """Julian Day Number for noon UTC on the given date."""
    assert _SWE_AVAILABLE
    return swe.julday(d.year, d.month, d.day, 12.0)
 def _planet_positions(jd: float) -> dict[str, float]:
    assert _SWE_AVAILABLE
    positions: dict[str, float] = {}
    for pid, name in _PLANETS:
        result, _ = swe.calc_ut(jd, pid)
        positions[name] = result[0]  # ecliptic longitude
    return positions
 def _angular_diff(a: float, b: float) -> float:
    """Smallest angle between two ecliptic longitudes (0–180)."""
    diff = abs(a - b) % 360
    return diff if diff <= 180 else 360 - diff
 def _find_transits(natal: dict[str, float], today: dict[str, float]) -> list[dict]:
    """Return list of active transits between today's sky and natal chart."""
    transits: list[dict] = []
    for t_name, t_lon in today.items():
        for n_name, n_lon in natal.items():
            diff = _angular_diff(t_lon, n_lon)
            for angle, orb, aspect_name, nature in _ASPECTS:
                if abs(diff - angle) <= orb:
                    transits.append({
                        "transit_planet": t_name,
                        "natal_planet": n_name,
                        "aspect": aspect_name,
                        "nature": nature,
                        "orb": round(abs(diff - angle), 2),
                    })
    # Sort by tightness of orb
    transits.sort(key=lambda x: x["orb"])
    return transits
 def _format_transit(t: dict) -> str:
    tp, np, asp, nat = t["transit_planet"], t["natal_planet"], t["aspect"], t["nature"]
    tp_theme = _PLANET_THEMES.get(tp, "")
    np_theme = _PLANET_THEMES.get(np, "")
    return (
        f"Transiting {tp} ({tp_theme}) {asp} natal {np} ({np_theme}) "
        f"— a {nat} influence"
    )
 # ---------------------------------------------------------------------------
 # Manifest
 # ---------------------------------------------------------------------------
 MANIFEST = AgentManifest(
    id="stars",
    version="1.0.0",
    description="Astrological transit predictions based on the user's birth date and today's planetary positions.",
    pref_schema={
        "type": "object",
        "additionalProperties": False,
        "properties": {
            "birth_date": {
                "type": "string",
                "pattern": r"^\d{4}-\d{2}-\d{2}$",
                "description": "ISO 8601 birth date (YYYY-MM-DD). Populated from connected data source.",
            },
        },
    },
    context_schema=["profile.birth_date"],
    # Requires a connected Google source that supplies birth date.
    # data:google-health is the current carrier; when Google profile is a
    # separate consent key, add it here.
    required_consents=["data:core", "data:google-health"],
    output_contract={"type": "snippet", "format": "free_text"},
    ttl_sec=3_600 * 6,  # planetary positions change slowly — 6 h is fine
    silenced_in_contexts=[],
    inferred_params=[
        InferredParam(
            key="birth_date",
            ttl_sec=365 * 86_400,   # effectively permanent once known
            cold_start_default=None,
            min_history=999_999,    # never inferred from events — sourced externally
            infer=None,
        ),
    ],
 )
 class StarsAgent(BaseAgent):
    """Produces astrological transit predictions for the user's birth chart."""
    agent_id: ClassVar[str] = MANIFEST.id
    ttl_seconds: ClassVar[int] = MANIFEST.ttl_sec
    version: ClassVar[str] = MANIFEST.version
    def compute(self, inp: AgentInput) -> AgentOutput:
        birth_date_str: str | None = inp.agent_prefs.get("birth_date")
        if not birth_date_str:
            prompt = (
                "Birth date is not available — astrological reading skipped. "
                "(Always write the tip in English.)"
            )
            return self._make_output(inp, prompt, {"no_birth_date": True})
        if not _SWE_AVAILABLE:
            prompt = (
                "Astrological library unavailable — reading skipped. "
                "(Always write the tip in English.)"
            )
            return self._make_output(inp, prompt, {"swe_unavailable": True})
        try:
            birth_date = date.fromisoformat(birth_date_str)
        except ValueError:
            prompt = "Birth date format invalid — astrological reading skipped."
            return self._make_output(inp, prompt, {"invalid_birth_date": birth_date_str})
        today_date = inp.now.date()
        natal_jd = _jd_from_date(birth_date)
        today_jd = _jd_from_date(today_date)
        natal_pos = _planet_positions(natal_jd)
        today_pos = _planet_positions(today_jd)
        transits = _find_transits(natal_pos, today_pos)
        top = transits[:3]  # most exact transits only
        today_sun_sign = _zodiac_sign(today_pos["Sun"])
        natal_sun_sign = _zodiac_sign(natal_pos["Sun"])
        natal_moon_sign = _zodiac_sign(natal_pos["Moon"])
        snapshot = {
            "birth_date": birth_date_str,
            "today": today_date.isoformat(),
            "natal_sun": natal_sun_sign,
            "natal_moon": natal_moon_sign,
            "today_sun": today_sun_sign,
            "active_transits": transits[:5],
        }
        if not top:
            prompt = (
                f"Natal chart: Sun in {natal_sun_sign}, Moon in {natal_moon_sign}. "
                f"Today's Sun is in {today_sun_sign}. "
                "No exact transits today — a quiet, stable day energetically. "
                "(Always write the tip in English.)"
            )
        else:
            transit_lines = "; ".join(_format_transit(t) for t in top)
            prompt = (
                f"Natal chart: Sun in {natal_sun_sign}, Moon in {natal_moon_sign}. "
                f"Today's Sun is in {today_sun_sign}. "
                f"Active transits: {transit_lines}. "
                "Use these planetary themes to colour the tip — "
                "keep it grounded and actionable, not predictive or fatalistic. "
                "(Always write the tip in English.)"
            )
        return self._make_output(inp, prompt, snapshot)
--- a/ml/agents/tarot.py
+++ b/ml/agents/tarot.py
@@ -0,0 +1,110 @@
 """TAROT agent — three-card draw (situation / action / outcome).
 Draws cards deterministically from a daily seed so the reading stays
 stable for the day (same cards whether the agent runs at 08:00 or 14:00).
 Card meanings are precomputed here and passed as a structured snippet to
 the orchestrator, which weaves them into a grounded, actionable tip.
 """
 from __future__ import annotations
 import hashlib
 from typing import ClassVar
 from .base import BaseAgent, AgentInput, AgentOutput
 from .manifest import AgentManifest
 # ---------------------------------------------------------------------------
 # Card definitions — Major Arcana only (22 cards, indices 0–21)
 # Each entry: (name, upright_meaning, action_hint)
 # ---------------------------------------------------------------------------
 _CARDS: list[tuple[str, str, str]] = [
    ("The Fool",         "new beginnings, spontaneity, a leap of faith",          "start something without overthinking"),
    ("The Magician",     "skill, willpower, resourcefulness",                      "use what you already have"),
    ("The High Priestess","intuition, inner knowing, patience",                    "listen to what you already sense is true"),
    ("The Empress",      "abundance, creativity, nurturing",                       "invest energy in something generative"),
    ("The Emperor",      "structure, authority, discipline",                       "set a boundary or impose order"),
    ("The Hierophant",   "tradition, guidance, shared values",                     "seek or offer mentorship"),
    ("The Lovers",       "alignment, choice, commitment",                          "make a decision you have been avoiding"),
    ("The Chariot",      "determination, focus, forward motion",                   "push through the resistance"),
    ("Strength",         "inner courage, patience, gentle persistence",            "stay the course with compassion"),
    ("The Hermit",       "solitude, reflection, inner guidance",                   "step back and think before acting"),
    ("Wheel of Fortune", "cycles, turning points, inevitable change",              "acknowledge what is shifting around you"),
    ("Justice",          "fairness, truth, cause and effect",                      "audit a recent decision for its real consequences"),
    ("The Hanged Man",   "pause, surrender, new perspective",                      "release your grip on the outcome"),
    ("Death",            "endings, transformation, release",                       "let go of what no longer serves you"),
    ("Temperance",       "balance, moderation, patience",                          "blend two competing demands"),
    ("The Devil",        "attachment, habit, shadow patterns",                     "name a loop you are stuck in"),
    ("The Tower",        "sudden disruption, revelation, necessary collapse",      "accept the thing that already broke"),
    ("The Star",         "hope, renewal, calm after the storm",                    "trust that recovery is already underway"),
    ("The Moon",         "uncertainty, illusion, the unconscious",                 "sit with ambiguity rather than forcing clarity"),
    ("The Sun",          "clarity, vitality, success",                             "act from your most energised self"),
    ("Judgement",        "reflection, reckoning, a call to rise",                  "respond to a long-deferred summons"),
    ("The World",        "completion, integration, a cycle closing",               "acknowledge what you have finished"),
 ]
 _POSITIONS = ("situation", "action", "outcome")
 def _daily_draw(user_id: str, date_str: str) -> list[int]:
    """Return three distinct card indices seeded by (user_id, date)."""
    seed = hashlib.sha256(f"{user_id}:{date_str}".encode()).digest()
    indices: list[int] = []
    offset = 0
    while len(indices) < 3:
        val = int.from_bytes(seed[offset:offset + 2], "big") % len(_CARDS)
        if val not in indices:
            indices.append(val)
        offset = (offset + 2) % (len(seed) - 1)
    return indices
 MANIFEST = AgentManifest(
    id="tarot",
    version="1.0.0",
    description="Daily three-card draw (situation/action/outcome) that frames the tip as a symbolic reflection.",
    pref_schema={
        "type": "object",
        "additionalProperties": False,
        "properties": {
            "enabled": {
                "type": "boolean",
                "default": True,
                "description": "Set false to disable the tarot agent for this user.",
            },
        },
    },
    context_schema=[],
    required_consents=["data:core"],
    output_contract={"type": "snippet", "format": "free_text"},
    ttl_sec=3_600 * 6,   # stable for 6 h; refreshes mid-day at most twice
    silenced_in_contexts=[],
    inferred_params=[],
 )
 class TarotAgent(BaseAgent):
    """Produces a three-card reading as a prompt snippet."""
    agent_id: ClassVar[str] = MANIFEST.id
    ttl_seconds: ClassVar[int] = MANIFEST.ttl_sec
    version: ClassVar[str] = MANIFEST.version
    def compute(self, inp: AgentInput) -> AgentOutput:
        date_str = inp.now.strftime("%Y-%m-%d")
        indices = _daily_draw(inp.user_id, date_str)
        reading: list[dict] = []
        parts: list[str] = [f"Today's tarot reading ({date_str}):"]
        for pos, idx in zip(_POSITIONS, indices):
            name, meaning, hint = _CARDS[idx]
            reading.append({"position": pos, "card": name, "meaning": meaning, "hint": hint})
            parts.append(f"  {pos.capitalize()} — {name}: {meaning}. Hint: {hint}.")
        parts.append(
            "Weave these symbolic themes lightly into the tip — "
            "ground them in practical, specific action. "
            "Do not explain the cards; let their meaning shape the advice."
        )
        prompt = "\n".join(parts)
        snapshot = {"date": date_str, "reading": reading}
        return self._make_output(inp, prompt, snapshot)
--- a/ml/agents/tests/init.py
+++ b/ml/agents/tests/init.py
--- a/ml/agents/tests/test_agents.py
+++ b/ml/agents/tests/test_agents.py
@@ -0,0 +1,370 @@
 """Unit tests for all sub-agents and the registry."""
 from __future__ import annotations
 import sys, os
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", ".."))
 from datetime import datetime, timezone
 import pytest
 from ml.agents.base import AgentInput, AgentOutput
 from ml.agents.overdue_task import OverdueTaskAgent
 from ml.agents.momentum import MomentumAgent
 from ml.agents.time_of_day import TimeOfDayAgent
 from ml.agents.recent_patterns import RecentPatternsAgent
 from ml.agents.focus_area import FocusAreaAgent
 from ml.agents.tarot import TarotAgent, _daily_draw, _CARDS, _POSITIONS
 from ml.agents.stars import StarsAgent, _SWE_AVAILABLE
 from ml.agents.registry import get_agent, all_agents
 _NOW = datetime(2026, 5, 1, 9, 0, 0, tzinfo=timezone.utc)  # Thursday 09:00 UTC
 def _inp(**kwargs) -> AgentInput:
    defaults = dict(
        user_id="u1",
        tasks=[],
        profile={},
        feedback_history=[],
        now=_NOW,
    )
    defaults.update(kwargs)
    return AgentInput(**defaults)
 def _task(content="Do thing", is_overdue=False, task_age_days=0.0, priority=1, project_id=None):
    t = {"id": "t1", "content": content, "is_overdue": is_overdue,
         "task_age_days": task_age_days, "priority": priority}
    if project_id:
        t["project_id"] = project_id
    return t
 # ── helpers ──────────────────────────────────────────────────────────────────
 def _check_output(out: AgentOutput, agent) -> None:
    assert isinstance(out, AgentOutput)
    assert out.user_id == "u1"
    assert out.agent_id == agent.agent_id
    assert out.prompt_text
    assert out.computed_at
    assert out.expires_at > out.computed_at
    assert out.agent_version == agent.version
 # ── OverdueTaskAgent ──────────────────────────────────────────────────────────
 class TestOverdueTaskAgent:
    agent = OverdueTaskAgent()
    def test_no_overdue(self):
        out = self.agent.compute(_inp(tasks=[_task("Read book")]))
        _check_output(out, self.agent)
        assert "no overdue" in out.prompt_text.lower()
        assert out.signals_snapshot["overdue_count"] == 0
    def test_single_overdue(self):
        out = self.agent.compute(_inp(tasks=[_task("Call dentist", is_overdue=True, task_age_days=3)]))
        _check_output(out, self.agent)
        assert "1 overdue" in out.prompt_text
        assert "Call dentist" in out.prompt_text
        assert "3 day" in out.prompt_text
    def test_multiple_overdue_top3(self):
        tasks = [
            _task(f"Task {i}", is_overdue=True, task_age_days=float(i))
            for i in range(1, 6)
        ]
        out = self.agent.compute(_inp(tasks=tasks))
        _check_output(out, self.agent)
        assert "5 overdue" in out.prompt_text
        assert out.signals_snapshot["overdue_count"] == 5
        assert len(out.signals_snapshot["top_overdue"]) == 3
        # Top 3 should be highest age: 5, 4, 3
        ages = [t["task_age_days"] for t in out.signals_snapshot["top_overdue"]]
        assert ages == sorted(ages, reverse=True)
    def test_ttl_respected(self):
        out = self.agent.compute(_inp())
        assert out.expires_at > out.computed_at
 # ── MomentumAgent ─────────────────────────────────────────────────────────────
 class TestMomentumAgent:
    agent = MomentumAgent()
    def test_no_profile(self):
        out = self.agent.compute(_inp(profile={}))
        _check_output(out, self.agent)
        assert "new user" in out.prompt_text.lower() or "no " in out.prompt_text.lower()
    def test_strong_engagement(self):
        out = self.agent.compute(_inp(profile={"completion_rate_30d": 0.65, "dismiss_rate_30d": 0.05}))
        assert "strong engagement" in out.prompt_text
    def test_low_completion_warns(self):
        out = self.agent.compute(_inp(profile={"completion_rate_30d": 0.1}))
        assert "low engagement" in out.prompt_text
        assert "actionable" in out.prompt_text
    def test_high_dismiss_warns(self):
        out = self.agent.compute(_inp(profile={"completion_rate_30d": 0.3, "dismiss_rate_30d": 0.5}))
        assert "dismiss rate is high" in out.prompt_text.lower()
    def test_early_stage_user(self):
        out = self.agent.compute(_inp(profile={"tip_volume_30d": 2.0}))
        assert "early-stage" in out.prompt_text
 # ── TimeOfDayAgent ────────────────────────────────────────────────────────────
 class TestTimeOfDayAgent:
    agent = TimeOfDayAgent()
    def test_morning_label(self):
        inp = _inp(now=datetime(2026, 5, 1, 8, 0, tzinfo=timezone.utc))  # Friday
        out = self.agent.compute(inp)
        assert "morning" in out.prompt_text
        assert "08:00" in out.prompt_text
    def test_weekend_note(self):
        inp = _inp(now=datetime(2026, 5, 2, 10, 0, tzinfo=timezone.utc))  # Saturday
        out = self.agent.compute(inp)
        assert "weekend" in out.prompt_text.lower()
    def test_peak_hour_exact(self):
        inp = _inp(
            now=datetime(2026, 5, 1, 10, 0, tzinfo=timezone.utc),
            profile={"preferred_hour": 10.0},
        )
        out = self.agent.compute(inp)
        assert "peak productivity hour" in out.prompt_text
    def test_approaching_peak(self):
        inp = _inp(
            now=datetime(2026, 5, 1, 9, 0, tzinfo=timezone.utc),
            profile={"preferred_hour": 10.0},
        )
        out = self.agent.compute(inp)
        assert "approaching" in out.prompt_text.lower()
    def test_no_preferred_hour(self):
        out = self.agent.compute(_inp())
        assert "no preferred-hour" in out.prompt_text.lower()
    def test_snapshot_keys(self):
        out = self.agent.compute(_inp())
        assert {"hour", "day_of_week", "preferred_hour", "quiet_start", "quiet_end",
                "peak_hours", "in_quiet", "in_peak", "tz"} == set(out.signals_snapshot)
 # ── RecentPatternsAgent ───────────────────────────────────────────────────────
 class TestRecentPatternsAgent:
    agent = RecentPatternsAgent()
    def test_no_feedback(self):
        out = self.agent.compute(_inp())
        assert "no tip reactions" in out.prompt_text.lower()
    def test_recent_feedback_summary(self):
        now_iso = _NOW.isoformat()
        feedback = [
            {"action": "done", "dwell_ms": 30000, "created_at": now_iso},
            {"action": "done", "dwell_ms": 45000, "created_at": now_iso},
            {"action": "dismiss", "dwell_ms": 2000, "created_at": now_iso},
        ]
        out = self.agent.compute(_inp(feedback_history=feedback))
        assert "3 tip reactions" in out.prompt_text
        assert "2 completed" in out.prompt_text
        assert "1 dismissed" in out.prompt_text
    def test_old_feedback_excluded(self):
        # 10 days ago — should be excluded from 7-day window
        old_iso = "2026-04-21T09:00:00+00:00"
        feedback = [{"action": "done", "dwell_ms": 5000, "created_at": old_iso}]
        out = self.agent.compute(_inp(feedback_history=feedback))
        assert "no tip reactions" in out.prompt_text.lower()
    def test_short_dwell_note(self):
        now_iso = _NOW.isoformat()
        feedback = [{"action": "done", "dwell_ms": 5000, "created_at": now_iso}]
        out = self.agent.compute(_inp(
            feedback_history=feedback,
            profile={"mean_dwell_ms_30d": 5000.0},
        ))
        assert "auto-pilot" in out.prompt_text.lower() or "short" in out.prompt_text.lower()
    def test_long_dwell_note(self):
        now_iso = _NOW.isoformat()
        feedback = [{"action": "done", "dwell_ms": 90000, "created_at": now_iso}]
        out = self.agent.compute(_inp(
            feedback_history=feedback,
            profile={"mean_dwell_ms_30d": 90000.0},
        ))
        assert "deliberate" in out.prompt_text.lower() or "reflection" in out.prompt_text.lower()
 # ── FocusAreaAgent ────────────────────────────────────────────────────────────
 class TestFocusAreaAgent:
    agent = FocusAreaAgent()
    def test_no_tasks(self):
        out = self.agent.compute(_inp())
        assert "no tasks" in out.prompt_text.lower()
    def test_lists_all_clusters(self):
        tasks = (
            [_task(f"W{i}", project_id="Work") for i in range(3)]
            + [_task(f"H{i}", project_id="Home") for i in range(2)]
        )
        out = self.agent.compute(_inp(tasks=tasks))
        assert "Work" in out.prompt_text
        assert "Home" in out.prompt_text
    def test_includes_task_titles(self):
        tasks = [_task("Buy milk", project_id="Personal"), _task("Write report", project_id="Personal")]
        out = self.agent.compute(_inp(tasks=tasks))
        assert '"Buy milk"' in out.prompt_text
        assert '"Write report"' in out.prompt_text
    def test_task_count_in_output(self):
        tasks = [_task(f"T{i}", project_id="Work") for i in range(3)]
        out = self.agent.compute(_inp(tasks=tasks))
        assert "3 task" in out.prompt_text
    def test_default_project_fallback(self):
        out = self.agent.compute(_inp(tasks=[_task("No project task")]))
        assert "Tasks" in out.prompt_text
    def test_snapshot_keys(self):
        out = self.agent.compute(_inp(tasks=[_task("T1", project_id="A")]))
        public_keys = {k for k in out.signals_snapshot if not k.startswith("_")}
        assert {"cluster_count", "clusters"} == public_keys
    def test_snapshot_clusters_shape(self):
        tasks = [_task("Buy milk", project_id="P1"), _task("Fix bug", project_id="P2")]
        out = self.agent.compute(_inp(tasks=tasks))
        clusters = out.signals_snapshot["clusters"]
        assert isinstance(clusters, list)
        assert all("label" in c and "task_count" in c and "tasks" in c for c in clusters)
 # ── TarotAgent ────────────────────────────────────────────────────────────────
 class TestTarotAgent:
    agent = TarotAgent()
    def test_basic_output(self):
        out = self.agent.compute(_inp())
        _check_output(out, self.agent)
        assert "situation" in out.prompt_text.lower()
        assert "action" in out.prompt_text.lower()
        assert "outcome" in out.prompt_text.lower()
        assert out.signals_snapshot["date"] == "2026-05-01"
        assert len(out.signals_snapshot["reading"]) == 3
    def test_three_distinct_cards(self):
        out = self.agent.compute(_inp())
        cards = [r["card"] for r in out.signals_snapshot["reading"]]
        assert len(set(cards)) == 3
    def test_positions_labelled(self):
        out = self.agent.compute(_inp())
        positions = [r["position"] for r in out.signals_snapshot["reading"]]
        assert positions == list(_POSITIONS)
    def test_daily_stability(self):
        out1 = self.agent.compute(_inp(now=datetime(2026, 5, 1, 8, 0, 0, tzinfo=timezone.utc)))
        out2 = self.agent.compute(_inp(now=datetime(2026, 5, 1, 20, 0, 0, tzinfo=timezone.utc)))
        assert out1.signals_snapshot["reading"] == out2.signals_snapshot["reading"]
    def test_different_days_different_draw(self):
        out1 = self.agent.compute(_inp(now=datetime(2026, 5, 1, 9, 0, 0, tzinfo=timezone.utc)))
        out2 = self.agent.compute(_inp(now=datetime(2026, 5, 2, 9, 0, 0, tzinfo=timezone.utc)))
        assert out1.signals_snapshot["reading"] != out2.signals_snapshot["reading"]
    def test_different_users_different_draw(self):
        out1 = self.agent.compute(_inp(user_id="user-A"))
        out2 = self.agent.compute(_inp(user_id="user-B"))
        assert out1.signals_snapshot["reading"] != out2.signals_snapshot["reading"]
    def test_daily_draw_returns_valid_indices(self):
        indices = _daily_draw("u1", "2026-05-01")
        assert len(indices) == 3
        assert len(set(indices)) == 3
        assert all(0 <= i < len(_CARDS) for i in indices)
 # ── StarsAgent ────────────────────────────────────────────────────────────────
 class TestStarsAgent:
    agent = StarsAgent()
    def test_no_birth_date(self):
        out = self.agent.compute(_inp())
        _check_output(out, self.agent)
        assert out.signals_snapshot.get("no_birth_date") is True
        assert "birth date" in out.prompt_text.lower()
    @pytest.mark.skipif(not _SWE_AVAILABLE, reason="pyswisseph not installed")
    def test_invalid_birth_date(self):
        out = self.agent.compute(_inp(agent_prefs={"birth_date": "not-a-date"}))
        _check_output(out, self.agent)
        assert out.signals_snapshot.get("invalid_birth_date") == "not-a-date"
    @pytest.mark.skipif(not _SWE_AVAILABLE, reason="pyswisseph not installed")
    def test_with_birth_date(self):
        out = self.agent.compute(_inp(agent_prefs={"birth_date": "1990-06-15"}))
        _check_output(out, self.agent)
        assert "natal" in out.prompt_text.lower()
        assert out.signals_snapshot["birth_date"] == "1990-06-15"
        assert "natal_sun" in out.signals_snapshot
        assert "natal_moon" in out.signals_snapshot
    @pytest.mark.skipif(not _SWE_AVAILABLE, reason="pyswisseph not installed")
    def test_transit_snapshot_structure(self):
        out = self.agent.compute(_inp(agent_prefs={"birth_date": "1985-03-21"}))
        snap = out.signals_snapshot
        assert "active_transits" in snap
        for t in snap["active_transits"]:
            assert {"transit_planet", "natal_planet", "aspect", "nature", "orb"} <= t.keys()
    def test_swe_unavailable_path(self, monkeypatch):
        import ml.agents.stars as stars_mod
        monkeypatch.setattr(stars_mod, "_SWE_AVAILABLE", False)
        agent = StarsAgent()
        out = agent.compute(_inp(agent_prefs={"birth_date": "1990-06-15"}))
        _check_output(out, agent)
        assert out.signals_snapshot.get("swe_unavailable") is True
 # ── Registry ─────────────────────────────────────────────────────────────────
 class TestRegistry:
    def test_all_agents_present(self):
        agents = all_agents()
        ids = {a.agent_id for a in agents}
        assert ids == {"overdue-task", "momentum", "time-of-day", "recent-patterns", "focus-area", "health-vitals", "tarot", "stars"}
    def test_get_agent(self):
        a = get_agent("momentum")
        assert a.agent_id == "momentum"
    def test_get_unknown_raises(self):
        with pytest.raises(KeyError, match="Unknown agent"):
            get_agent("nonexistent")
    def test_all_agents_compute(self):
        inp = _inp(
            tasks=[_task("Buy milk", is_overdue=True, task_age_days=2, project_id="Personal")],
            profile={"completion_rate_30d": 0.4, "tip_volume_30d": 10.0, "preferred_hour": 9.0},
            feedback_history=[
                {"action": "done", "dwell_ms": 25000, "created_at": _NOW.isoformat()}
            ],
        )
        for agent in all_agents():
            out = agent.compute(inp)
            _check_output(out, agent)
--- a/ml/agents/tests/test_clustering.py
+++ b/ml/agents/tests/test_clustering.py
@@ -0,0 +1,209 @@
 """Unit tests for ml.agents.clustering (issue #97, #129).
 LLM and embedding calls are mocked so tests run without Ollama or LiteLLM.
 """
 from __future__ import annotations
 import sys, os
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", ".."))
 from unittest.mock import patch
 from ml.agents.clustering import cluster_tasks, Cluster, _greedy_cluster, _cosine, _embed_batch, _enrich_batch
 # ── helpers ──────────────────────────────────────────────────────────────────
 def _task(content: str, project_id: str | None = None, is_overdue: bool = False) -> dict:
    t: dict = {"content": content, "is_overdue": is_overdue}
    if project_id:
        t["project_id"] = project_id
    return t
 def _embed_seq(*vecs):
    """Return a side_effect list so successive _embed calls return these vectors."""
    return list(vecs)
 # ── Cluster dataclass ─────────────────────────────────────────────────────────
 class TestCluster:
    def test_task_count(self):
        c = Cluster(label="X", tasks=[_task("a"), _task("b")])
        assert c.task_count == 2
    def test_overdue_count(self):
        c = Cluster(label="X", tasks=[_task("a", is_overdue=True), _task("b")])
        assert c.overdue_count == 1
 # ── cosine similarity ─────────────────────────────────────────────────────────
 class TestCosine:
    def test_identical_vectors(self):
        v = [1.0, 0.0, 0.0]
        assert _cosine(v, v) == 1.0
    def test_orthogonal_vectors(self):
        assert _cosine([1.0, 0.0], [0.0, 1.0]) == 0.0
    def test_zero_vector(self):
        assert _cosine([0.0, 0.0], [1.0, 0.0]) == 0.0
 # ── greedy clustering ─────────────────────────────────────────────────────────
 class TestGreedyClustering:
    def _similar_vec(self, base: list[float], noise: float = 0.01) -> list[float]:
        return [x + noise for x in base]
    def test_similar_tasks_grouped(self):
        v = [1.0, 0.0, 0.0]
        v2 = [0.999, 0.001, 0.0]
        items = [
            (_task("A"), v),
            (_task("B"), v2),
        ]
        clusters = _greedy_cluster(items)
        assert len(clusters) == 1
        assert clusters[0].task_count == 2
    def test_dissimilar_tasks_separate(self):
        v1 = [1.0, 0.0, 0.0]
        v2 = [0.0, 1.0, 0.0]
        items = [(_task("A"), v1), (_task("B"), v2)]
        clusters = _greedy_cluster(items)
        assert len(clusters) == 2
    def test_label_from_first_task(self):
        v = [1.0, 0.0]
        clusters = _greedy_cluster([(_task("Write report"), v)])
        assert clusters[0].label == "Write report"
 # ── enrichment ───────────────────────────────────────────────────────────────
 class TestEnrichBatch:
    def test_falls_back_to_raw_when_no_litellm_url(self, monkeypatch):
        monkeypatch.delenv("LITELLM_URL", raising=False)
        result, new = _enrich_batch(["Buy milk", "Fix bug"])
        assert result == ["Buy milk", "Fix bug"] and new == {}
    def test_uses_description_when_litellm_available(self, monkeypatch):
        monkeypatch.setenv("LITELLM_URL", "http://fake-litellm")
        with patch("ml.agents.clustering._enrich_title", return_value="Expanded description."):
            result, new = _enrich_batch(["Buy milk"])
        assert result == ["Expanded description."]
        assert len(new) == 1
    def test_falls_back_to_raw_title_on_enrich_failure(self, monkeypatch):
        monkeypatch.setenv("LITELLM_URL", "http://fake-litellm")
        with patch("ml.agents.clustering._enrich_title", return_value=None):
            result, new = _enrich_batch(["Buy milk"])
        assert result == ["Buy milk"]
        assert new == {}  # failed enrichments are not persisted
    def test_deduplicates_identical_titles(self, monkeypatch):
        monkeypatch.setenv("LITELLM_URL", "http://fake-litellm")
        call_count = {"n": 0}
        def fake_enrich(title, url):
            call_count["n"] += 1
            return f"desc:{title}"
        with patch("ml.agents.clustering._enrich_title", side_effect=fake_enrich):
            result, new = _enrich_batch(["Buy milk", "Buy milk", "Fix bug"])
        assert call_count["n"] == 2  # only 2 unique titles
        assert result == ["desc:Buy milk", "desc:Buy milk", "desc:Fix bug"]
    def test_uses_persistent_cache(self, monkeypatch):
        monkeypatch.setenv("LITELLM_URL", "http://fake-litellm")
        from ml.agents.clustering import _content_hash
        h = _content_hash("Buy milk")
        call_count = {"n": 0}
        def fake_enrich(title, url):
            call_count["n"] += 1
            return "new desc"
        with patch("ml.agents.clustering._enrich_title", side_effect=fake_enrich):
            result, new = _enrich_batch(["Buy milk"], persistent_cache={h: "cached desc"})
        assert call_count["n"] == 0  # cache hit, no LLM call
        assert result == ["cached desc"]
        assert new == {}
 # ── cluster_tasks integration ─────────────────────────────────────────────────
 class TestClusterTasks:
    def _no_enrich(self, titles, persistent_cache=None):
        return titles, {}
    def test_empty_tasks(self):
        clusters, new = cluster_tasks([])
        assert clusters == [] and new == {}
    def test_fallback_when_embed_unavailable(self):
        with patch("ml.agents.clustering._enrich_batch", side_effect=self._no_enrich), \
             patch("ml.agents.clustering._embed_batch", return_value=None):
            tasks = [_task("A", "p1"), _task("B", "p2"), _task("C", "p1")]
            clusters, _ = cluster_tasks(tasks)
        assert len(clusters) == 2
        labels = {c.label for c in clusters}
        assert "p1" in labels and "p2" in labels
    def test_fallback_groups_by_project(self):
        with patch("ml.agents.clustering._enrich_batch", side_effect=self._no_enrich), \
             patch("ml.agents.clustering._embed_batch", return_value=None):
            tasks = [_task("A", "work")] * 3 + [_task("B", "home")] * 2
            clusters, _ = cluster_tasks(tasks)
        by_label = {c.label: c.task_count for c in clusters}
        assert by_label["work"] == 3
        assert by_label["home"] == 2
    def test_tasks_without_content_go_to_other(self):
        v = [1.0, 0.0]
        with patch("ml.agents.clustering._enrich_batch", side_effect=self._no_enrich), \
             patch("ml.agents.clustering._embed_batch", return_value=[v]):
            tasks = [_task("Has content"), {"is_overdue": False}]
            clusters, _ = cluster_tasks(tasks)
        labels = {c.label for c in clusters}
        assert "Other tasks" in labels
    def test_semantic_clustering_groups_similar(self):
        v_work = [1.0, 0.0, 0.0]
        v_home = [0.0, 1.0, 0.0]
        batch_result = [v_work, v_work, v_home, v_home]
        with patch("ml.agents.clustering._enrich_batch", side_effect=self._no_enrich), \
             patch("ml.agents.clustering._embed_batch", return_value=batch_result):
            tasks = [
                _task("Write report"),
                _task("Review PR"),
                _task("Buy groceries"),
                _task("Cook dinner"),
            ]
            clusters, _ = cluster_tasks(tasks)
        assert len(clusters) == 2
        assert all(c.task_count == 2 for c in clusters)
    def test_all_tasks_no_content_fallback_by_project(self):
        tasks = [{"project_id": "p1", "is_overdue": False},
                 {"project_id": "p2", "is_overdue": False}]
        clusters, new = cluster_tasks(tasks)
        assert len(clusters) == 2 and new == {}
    def test_enrich_called_before_embed(self):
        """Verify enrichment output (not raw title) is what gets embedded."""
        v = [1.0, 0.0]
        captured = {}
        def fake_embed(texts):
            captured["texts"] = texts
            return [v] * len(texts)
        with patch("ml.agents.clustering._enrich_batch", return_value=(["Expanded desc."], {})), \
             patch("ml.agents.clustering._embed_batch", side_effect=fake_embed):
            cluster_tasks([_task("Buy milk")])
        assert captured["texts"] == ["clustering: Expanded desc."]
    def test_new_enrichments_returned(self):
        v = [1.0, 0.0]
        with patch("ml.agents.clustering._enrich_batch", return_value=(["desc"], {"abc123": "desc"})), \
             patch("ml.agents.clustering._embed_batch", return_value=[v]):
            _, new = cluster_tasks([_task("Buy milk")])
        assert new == {"abc123": "desc"}
--- a/ml/agents/tests/test_inference.py
+++ b/ml/agents/tests/test_inference.py
@@ -0,0 +1,120 @@
 """Tests for the inference framework and time-of-day #112 proof."""
 from __future__ import annotations
 import sys, os
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", ".."))
 import pytest
 from datetime import datetime, timezone
 from ml.agents.inference.history import FeedbackEvent, UserHistory
 from ml.agents.inference.framework import run_inference
 from ml.agents.time_of_day import TimeOfDayAgent, MANIFEST as TOD_MANIFEST, MANIFEST
 from ml.agents.base import AgentInput
 _NOW = datetime(2026, 5, 1, 14, 0, 0, tzinfo=timezone.utc)  # Thursday 14:00
 def _inp(**kwargs) -> AgentInput:
    defaults = dict(user_id="u1", tasks=[], profile={}, now=_NOW, agent_prefs={})
    defaults.update(kwargs)
    return AgentInput(**defaults)
 def _event(action: str, hour: int) -> FeedbackEvent:
    ts = f"2026-05-01T{hour:02d}:00:00+00:00"
    return FeedbackEvent(action=action, dwell_ms=60_000 if action == "done" else 500, created_at=ts)
 class TestRunInference:
    def test_cold_start_when_below_min_history(self):
        history = UserHistory(user_id="u1", events=[_event("done", 9)] * 5)  # only 5 < 10
        result = run_inference(TOD_MANIFEST, history)
        assert result["preferred_hour"] is None  # cold_start_default
    def test_infers_preferred_hour_as_mode(self):
        # 7 events at 09:00, 3 at 17:00 → preferred_hour should be 9
        events = [_event("done", 9)] * 7 + [_event("done", 17)] * 3
        history = UserHistory(user_id="u1", events=events)
        result = run_inference(TOD_MANIFEST, history)
        assert result["preferred_hour"] == 9
    def test_infers_preferred_hour_from_majority_hour(self):
        events = [_event("done", 20)] * 6 + [_event("done", 8)] * 4
        history = UserHistory(user_id="u1", events=events)
        result = run_inference(TOD_MANIFEST, history)
        assert result["preferred_hour"] == 20
    def test_no_inferred_params_returns_empty(self):
        from ml.agents.manifest import AgentManifest
        bare = AgentManifest(
            id="bare", version="1.0.0", description="", pref_schema={},
            context_schema=[], required_consents=[], output_contract={}, ttl_sec=300,
        )
        history = UserHistory(user_id="u1", events=[_event("done", 9)] * 20)
        result = run_inference(bare, history)
        assert result == {}
    def test_cold_start_fallback_on_infer_error(self):
        """infer() raising should fall back to cold_start_default, not crash."""
        from ml.agents.manifest import InferredParam, AgentManifest
        def _bad_infer(h):
            raise RuntimeError("oops")
        m = AgentManifest(
            id="boom", version="1.0.0", description="", pref_schema={},
            context_schema=[], required_consents=[], output_contract={}, ttl_sec=300,
            inferred_params=[InferredParam(key="x", ttl_sec=60, cold_start_default=42, min_history=1, infer=_bad_infer)],
        )
        history = UserHistory(user_id="u1", events=[_event("done", 9)] * 5)
        result = run_inference(m, history)
        assert result["x"] == 42
 class TestTimeOfDayAgentWithInference:
    agent = TimeOfDayAgent()
    def test_uses_preferred_hour_from_agent_prefs(self):
        inp = _inp(agent_prefs={"preferred_hour": 9}, now=datetime(2026, 5, 1, 9, 0, 0, tzinfo=timezone.utc))
        out = self.agent.compute(inp)
        assert "peak productivity hour" in out.prompt_text.lower() or "peak" in out.prompt_text
    def test_quiet_window_noon_suppressed(self):
        inp = _inp(
            agent_prefs={"quiet_start": "22:00", "quiet_end": "07:00"},
            now=datetime(2026, 5, 1, 23, 0, 0, tzinfo=timezone.utc),
        )
        out = self.agent.compute(inp)
        assert "quiet window" in out.prompt_text
    def test_quiet_window_not_in_window(self):
        inp = _inp(
            agent_prefs={"quiet_start": "22:00", "quiet_end": "07:00"},
            now=datetime(2026, 5, 1, 14, 0, 0, tzinfo=timezone.utc),
        )
        out = self.agent.compute(inp)
        assert "quiet window" not in out.prompt_text
    def test_agent_prefs_override_profile(self):
        # agent_prefs.preferred_hour wins over profile.preferred_hour
        inp = _inp(
            profile={"preferred_hour": 8},
            agent_prefs={"preferred_hour": 14},
            now=datetime(2026, 5, 1, 14, 0, 0, tzinfo=timezone.utc),
        )
        out = self.agent.compute(inp)
        assert "peak productivity hour (14:00)" in out.prompt_text
    def test_no_prefs_falls_back_to_profile(self):
        inp = _inp(profile={"preferred_hour": 10}, now=datetime(2026, 5, 1, 10, 0, 0, tzinfo=timezone.utc))
        out = self.agent.compute(inp)
        assert "peak" in out.prompt_text
    def test_version_bumped(self):
        assert MANIFEST.version == "1.2.0"
    def test_manifest_has_preferred_hour_param(self):
        keys = {p.key for p in MANIFEST.inferred_params}
        assert "preferred_hour" in keys
--- a/ml/agents/tests/test_manifest.py
+++ b/ml/agents/tests/test_manifest.py
@@ -0,0 +1,68 @@
 """Manifest registry tests (ADR-0014).
 Each agent module exports a `MANIFEST: AgentManifest` whose id and version
 must agree with the agent class. The registry exposes both, and `to_dict()`
 must drop the `infer` callable so the wire payload is JSON-serialisable.
 """
 from __future__ import annotations
 import json
 import os
 import sys
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", ".."))
 import pytest  # noqa: E402
 from ml.agents.manifest import AgentManifest, InferredParam  # noqa: E402
 from ml.agents.registry import (  # noqa: E402
    all_agents,
    all_manifests,
    get_agent,
    get_manifest,
 )
 def test_every_agent_has_a_matching_manifest():
    agents = {a.agent_id: a for a in all_agents()}
    manifests = {m.id: m for m in all_manifests()}
    assert agents.keys() == manifests.keys(), "agent / manifest registries diverged"
    for aid in agents:
        assert agents[aid].version == manifests[aid].version, (
            f"version mismatch for {aid}: agent={agents[aid].version!r} "
            f"manifest={manifests[aid].version!r}"
        )
@pytest.mark.parametrize("agent_id", [
    "overdue-task", "momentum", "time-of-day", "recent-patterns", "focus-area",
 ])
 def test_manifest_required_fields(agent_id: str):
    m = get_manifest(agent_id)
    assert m.id == agent_id
    assert m.version
    assert m.description
    assert isinstance(m.pref_schema, dict) and m.pref_schema.get("type") == "object"
    assert isinstance(m.required_consents, list) and m.required_consents
    assert "data:core" in m.required_consents, "every agent should require data:core"
    assert all(c.startswith("data:") for c in m.required_consents), "only data: consents allowed; agent: consents have been removed"
    assert m.ttl_sec == get_agent(agent_id).ttl_seconds, "ttl divergence"
 def test_to_dict_is_json_serialisable_and_drops_infer_callable():
    m = AgentManifest(
        id="x", version="1.0.0", description="d",
        pref_schema={"type": "object"}, context_schema=[], required_consents=["data:core"],
        output_contract={"type": "snippet"}, ttl_sec=60,
        inferred_params=[InferredParam(key="k", ttl_sec=60, cold_start_default=0, min_history=10, infer=lambda h: 0)],
    )
    payload = m.to_dict()
    # Round-trip through json to confirm no callables / non-JSON types leaked.
    data = json.loads(json.dumps(payload))
    assert data["inferred_params"][0]["key"] == "k"
    assert "infer" not in data["inferred_params"][0]
 def test_get_manifest_unknown_raises():
    with pytest.raises(KeyError):
        get_manifest("not-an-agent")
--- a/ml/agents/tests/test_per_agent_inference.py
+++ b/ml/agents/tests/test_per_agent_inference.py
@@ -0,0 +1,663 @@
 """Per-agent inference tests: momentum (#114), overdue-task (#115), recent-patterns (#116),
 time-of-day (#112), and focus-area (#113) preferred_areas wiring."""
 from __future__ import annotations
 import sys, os
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", ".."))
 from datetime import datetime, timezone
 import pytest
 from ml.agents.inference.history import FeedbackEvent, TaskCompletion, UserHistory
 from ml.agents.inference.framework import run_inference
 from ml.agents.momentum import MomentumAgent, MANIFEST as MOMENTUM_MANIFEST
 from ml.agents.overdue_task import OverdueTaskAgent, MANIFEST as OVERDUE_MANIFEST
 from ml.agents.recent_patterns import RecentPatternsAgent, MANIFEST as RECENT_MANIFEST
 from ml.agents.time_of_day import TimeOfDayAgent, MANIFEST as TOD_MANIFEST
 from ml.agents.focus_area import FocusAreaAgent
 from ml.agents.base import AgentInput
 _NOW = datetime(2026, 5, 8, 14, 0, 0, tzinfo=timezone.utc)
 def _inp(**kwargs) -> AgentInput:
    defaults = dict(user_id="u1", tasks=[], profile={}, now=_NOW, agent_prefs={})
    defaults.update(kwargs)
    return AgentInput(**defaults)
 def _event(action: str, days_ago: float = 1.0) -> FeedbackEvent:
    from datetime import timedelta
    ts = (_NOW - timedelta(days=days_ago)).isoformat()
    dwell = 60_000 if action == "done" else 500
    return FeedbackEvent(action=action, dwell_ms=dwell, created_at=ts)
 def _history(*events: FeedbackEvent, completions: list[TaskCompletion] | None = None) -> UserHistory:
    return UserHistory(user_id="u1", events=list(events), task_completions=completions or [])
 def _completion(project_id: str | None, lateness_days: float) -> TaskCompletion:
    """Build a TaskCompletion where completed_at is lateness_days after due_at."""
    from datetime import timedelta
    due = _NOW - timedelta(days=30)
    completed = due + timedelta(days=lateness_days)
    return TaskCompletion(
        project_id=project_id,
        completed_at=completed.isoformat(),
        due_at=due.isoformat(),
    )
 # ── momentum helpers ─────────────────────────────────────────────────────────
 def _neutral_prefs(**extra) -> dict:
    """Prefs that put z-score in the normal range so trend label can show."""
    return {"baseline_completions_per_day": 0.0, "stdev": 1.0, "momentum_window": 7, **extra}
 def _feedback_done(n: int, days_ago: float = 1.0) -> list[dict]:
    from datetime import timedelta
    ts = (_NOW - timedelta(days=days_ago)).isoformat()
    return [{"action": "done", "dwell_ms": 60_000, "created_at": ts}] * n
 # ── momentum: engagement_trend inference ─────────────────────────────────────
 class TestMomentumTrendInference:
    def test_cold_start_below_min_history(self):
        history = _history(*[_event("done", days_ago=i) for i in range(5)])
        result = run_inference(MOMENTUM_MANIFEST, history)
        assert result["engagement_trend"] == "stable"  # cold_start_default
    def test_trend_up_when_recent_done_rate_higher(self):
        recent = [_event("done", days_ago=i) for i in range(1, 9)]
        older = [_event("dismiss", days_ago=i) for i in range(8, 15)]
        older[0] = _event("done", days_ago=8)
        history = _history(*recent, *older)
        result = run_inference(MOMENTUM_MANIFEST, history)
        assert result["engagement_trend"] == "up"
    def test_trend_down_when_recent_done_rate_lower(self):
        recent = [_event("dismiss", days_ago=i) for i in range(1, 8)]
        older = [_event("done", days_ago=i) for i in range(8, 15)]
        history = _history(*recent, *older)
        result = run_inference(MOMENTUM_MANIFEST, history)
        assert result["engagement_trend"] == "down"
    def test_trend_stable_when_similar(self):
        events = [_event("done" if i % 2 == 0 else "dismiss", days_ago=i) for i in range(1, 15)]
        history = _history(*events)
        result = run_inference(MOMENTUM_MANIFEST, history)
        assert result["engagement_trend"] == "stable"
    def test_trend_shown_when_z_score_normal(self):
        # baseline=0 so z≈0 → no z label → trend label falls through
        out = MomentumAgent().compute(_inp(agent_prefs=_neutral_prefs(engagement_trend="up")))
        assert "trending up" in out.prompt_text
    def test_trend_down_shown_when_z_score_normal(self):
        out = MomentumAgent().compute(_inp(agent_prefs=_neutral_prefs(engagement_trend="down")))
        assert "trending down" in out.prompt_text
    def test_snapshot_includes_trend(self):
        out = MomentumAgent().compute(_inp(agent_prefs=_neutral_prefs(engagement_trend="stable")))
        assert "engagement_trend" in out.signals_snapshot
 # ── momentum: baseline + stdev inference (#114) ───────────────────────────────
 class TestMomentumBaselineInference:
    def _events_n_per_day(self, done_per_day: int, n_days: int) -> list[FeedbackEvent]:
        """Generate done events spread across n_days."""
        events = []
        for d in range(n_days):
            for _ in range(done_per_day):
                events.append(_event("done", days_ago=d + 0.5))
        return events
    def test_cold_start_when_few_events(self):
        history = _history(*[_event("done", days_ago=i) for i in range(5)])
        result = run_inference(MOMENTUM_MANIFEST, history)
        assert result["baseline_completions_per_day"] == 1.0
        assert result["stdev"] == 1.0
    def test_power_user_baseline_high(self):
        # 5 done events per day for 20 days → baseline ≈ 5/day (over 28d window, zeros fill rest)
        events = self._events_n_per_day(5, 20)
        history = _history(*events)
        result = run_inference(MOMENTUM_MANIFEST, history)
        assert result["baseline_completions_per_day"] > 2.0
    def test_casual_user_baseline_low(self):
        # 1 done every 3 days + dismiss filler to clear min_history=14 → baseline ≈ 0.33/day
        done_events = [_event("done", days_ago=d * 3 + 0.5) for d in range(7)]
        filler = [_event("dismiss", days_ago=d + 0.5) for d in range(10)]
        history = _history(*done_events, *filler)
        result = run_inference(MOMENTUM_MANIFEST, history)
        assert result["baseline_completions_per_day"] < 0.5
    def test_stdev_reflects_variability(self):
        # Alternating 0 and 4 done events → high stdev
        events = []
        for d in range(14):
            if d % 2 == 0:
                for _ in range(4):
                    events.append(_event("done", days_ago=d + 0.5))
        history = _history(*events)
        result = run_inference(MOMENTUM_MANIFEST, history)
        assert result["stdev"] > 1.0
    def test_consistent_user_lower_stdev_than_variable(self):
        # Consistent 2/day for 28 days has lower stdev than alternating 0/4
        consistent = self._events_n_per_day(2, 28)
        variable = []
        for d in range(14):
            if d % 2 == 0:
                for _ in range(4):
                    variable.append(_event("done", days_ago=d + 0.5))
            else:
                variable.append(_event("dismiss", days_ago=d + 0.5))
        r_consistent = run_inference(MOMENTUM_MANIFEST, _history(*consistent))
        r_variable = run_inference(MOMENTUM_MANIFEST, _history(*variable))
        assert r_consistent["stdev"] < r_variable["stdev"]
 # ── momentum: z-score snippet language ───────────────────────────────────────
 class TestMomentumZScore:
    def _prefs(self, baseline: float, stdev: float = 1.0) -> dict:
        return {"baseline_completions_per_day": baseline, "stdev": stdev,
                "momentum_window": 7, "engagement_trend": "stable"}
    def test_power_user_above_baseline_says_above_usual(self):
        # baseline=3/day, stdev=1.0, window=7 → expected rate=3; user did 35 → rate=5, z=2
        prefs = self._prefs(baseline=3.0, stdev=1.0)
        feedback = _feedback_done(35, days_ago=1.0)
        out = MomentumAgent().compute(_inp(feedback_history=feedback, agent_prefs=prefs))
        assert "above your usual" in out.prompt_text
    def test_casual_user_slowing_down(self):
        # baseline=1/day, user did 0 in 7d → z = (0 - 1) / 1 = -1 → below usual
        prefs = self._prefs(baseline=1.0, stdev=1.0)
        out = MomentumAgent().compute(_inp(feedback_history=[], agent_prefs=prefs))
        assert "below your usual" in out.prompt_text
    def test_returning_from_break_at_normal_rate(self):
        # User just came back: 1 done, baseline=1/day, window=7 → z=(1/7-1)/1≈-0.86, within normal
        prefs = self._prefs(baseline=1.0, stdev=1.0)
        feedback = _feedback_done(1, days_ago=0.5)
        out = MomentumAgent().compute(_inp(feedback_history=feedback, agent_prefs=prefs))
        # z ≈ -0.86 → no z label, falls back to trend (stable → no extra sentence)
        assert "above your usual" not in out.prompt_text
        assert "below your usual" not in out.prompt_text
    def test_snapshot_includes_z_score(self):
        prefs = self._prefs(baseline=1.0)
        out = MomentumAgent().compute(_inp(agent_prefs=prefs))
        assert "z_score" in out.signals_snapshot
        assert "recent_done_count" in out.signals_snapshot
    def test_version_bumped(self):
        assert MOMENTUM_MANIFEST.version == "1.2.0"
 # ── overdue-task: lateness_tolerance_days + project_realness (#115) ──────────
 class TestOverdueTaskInference:
    # -- lateness_tolerance_days inference --
    def test_cold_start_returns_zero_when_few_completions(self):
        # Below min_history=10 task completions → cold start
        cs = [_completion("p1", 2.0) for _ in range(5)]
        history = _history(*[_event("done")] * 5, completions=cs)
        result = run_inference(OVERDUE_MANIFEST, history)
        assert result["lateness_tolerance_days"] == 0.0
    def test_punctual_user_zero_tolerance(self):
        # User always finishes early or on time (negative lateness) → tolerance 0
        cs = [_completion("p1", -1.0) for _ in range(12)]
        history = _history(*[_event("done")] * 12, completions=cs)
        result = run_inference(OVERDUE_MANIFEST, history)
        assert result["lateness_tolerance_days"] == 0.0
    def test_chronic_late_user_positive_tolerance(self):
        # User consistently finishes 5 days late → p50 = 5
        cs = [_completion("p1", 5.0) for _ in range(12)]
        history = _history(*[_event("done")] * 12, completions=cs)
        result = run_inference(OVERDUE_MANIFEST, history)
        assert result["lateness_tolerance_days"] == pytest.approx(5.0)
    def test_mixed_lateness_uses_median(self):
        # 6 tasks at +1d, 6 tasks at +3d → median = 2
        cs = [_completion("p1", 1.0)] * 6 + [_completion("p1", 3.0)] * 6
        history = _history(*[_event("done")] * 12, completions=cs)
        result = run_inference(OVERDUE_MANIFEST, history)
        assert result["lateness_tolerance_days"] == pytest.approx(2.0)
    # -- project_realness inference --
    def test_project_realness_cold_start_empty(self):
        cs = [_completion("p1", 1.0) for _ in range(5)]  # below min_history
        history = _history(*[_event("done")] * 5, completions=cs)
        result = run_inference(OVERDUE_MANIFEST, history)
        assert result["project_realness"] == {}
    def test_project_realness_punctual_project_scores_high(self):
        # p1 always on time (0d late), p2 always 10d late → p1 should be realness ≈ 1
        cs = [_completion("p1", 0.0)] * 6 + [_completion("p2", 10.0)] * 6
        history = _history(*[_event("done")] * 12, completions=cs)
        result = run_inference(OVERDUE_MANIFEST, history)
        assert result["project_realness"]["p1"] > result["project_realness"]["p2"]
    def test_project_realness_values_clipped_01(self):
        cs = [_completion("p1", 0.0)] * 6 + [_completion("p2", 100.0)] * 6
        history = _history(*[_event("done")] * 12, completions=cs)
        result = run_inference(OVERDUE_MANIFEST, history)
        for v in result["project_realness"].values():
            assert 0.0 <= v <= 1.0
    # -- compute() reads inferred prefs --
    def test_tolerance_filters_tasks(self):
        tasks = [
            {"content": "Fresh overdue", "is_overdue": True, "task_age_days": 0.5},
            {"content": "Old overdue", "is_overdue": True, "task_age_days": 3.0},
        ]
        out = OverdueTaskAgent().compute(_inp(tasks=tasks, agent_prefs={"lateness_tolerance_days": 2}))
        assert "1 overdue task" in out.prompt_text
        assert "Old overdue" in out.prompt_text
    def test_low_realness_softens_language(self):
        tasks = [{"content": "Wishlist", "is_overdue": True, "task_age_days": 3.0,
                  "project_id": "aspirational"}]
        prefs = {"lateness_tolerance_days": 0, "project_realness": {"aspirational": 0.2}}
        out = OverdueTaskAgent().compute(_inp(tasks=tasks, agent_prefs=prefs))
        assert "target date" in out.prompt_text
    def test_high_realness_uses_overdue_language(self):
        tasks = [{"content": "Critical", "is_overdue": True, "task_age_days": 3.0,
                  "project_id": "work"}]
        prefs = {"lateness_tolerance_days": 0, "project_realness": {"work": 0.9}}
        out = OverdueTaskAgent().compute(_inp(tasks=tasks, agent_prefs=prefs))
        assert "overdue" in out.prompt_text
    def test_snapshot_includes_realness(self):
        tasks = [{"content": "T", "is_overdue": True, "task_age_days": 1.0, "project_id": "p1"}]
        prefs = {"lateness_tolerance_days": 0, "project_realness": {"p1": 0.8}}
        out = OverdueTaskAgent().compute(_inp(tasks=tasks, agent_prefs=prefs))
        assert "realness" in out.signals_snapshot["top_overdue"][0]
    def test_version_bumped(self):
        assert OVERDUE_MANIFEST.version == "1.2.0"
 # ── recent-patterns: lookback_days + weekly_cycle + daily_cycle (#116) ────────
 def _done_at(days_ago: float, hour: int = 10) -> FeedbackEvent:
    """Done event at a specific hour, N days ago."""
    from datetime import timedelta
    ts = (_NOW - timedelta(days=days_ago)).replace(hour=hour, minute=0, second=0, microsecond=0)
    return FeedbackEvent(action="done", dwell_ms=60_000, created_at=ts.isoformat())
 class TestRecentPatternsLookbackInference:
    def test_cold_start_below_min_history(self):
        history = _history(*[_event("done") for _ in range(3)])
        result = run_inference(RECENT_MANIFEST, history)
        assert result["lookback_days"] == 7  # cold_start_default
    def test_sparse_done_history_returns_30(self):
        # Only 10 done events → fewer than 30 → returns cap of 30
        history = _history(*[_event("done") for _ in range(10)])
        result = run_inference(RECENT_MANIFEST, history)
        assert result["lookback_days"] == 30
    def test_dense_done_history_returns_short_window(self):
        # 30 done events all within the last 2 days → lookback_days = 1 or 2
        events = [_event("done", days_ago=i * 0.05) for i in range(30)]
        history = _history(*events)
        result = run_inference(RECENT_MANIFEST, history)
        assert result["lookback_days"] <= 2
    def test_spread_history_spans_window_correctly(self):
        # 30 done events spread over 15 days (1 per 0.5d) → window should be ≈15
        events = [_event("done", days_ago=i * 0.5) for i in range(30)]
        history = _history(*events)
        result = run_inference(RECENT_MANIFEST, history)
        assert result["lookback_days"] <= 16
    def test_agent_respects_lookback_days_pref(self):
        from datetime import timedelta
        feedback = [
            {"action": "done", "dwell_ms": 60000,
             "created_at": (_NOW - timedelta(days=10)).isoformat()}
        ] * 5
        out_narrow = RecentPatternsAgent().compute(
            _inp(feedback_history=feedback, agent_prefs={"lookback_days": 7})
        )
        out_wide = RecentPatternsAgent().compute(
            _inp(feedback_history=feedback, agent_prefs={"lookback_days": 14})
        )
        assert "No tip reactions" in out_narrow.prompt_text
        assert "5 tip reactions" in out_wide.prompt_text
    def test_legacy_window_days_pref_still_works(self):
        from datetime import timedelta
        feedback = [
            {"action": "done", "dwell_ms": 60000,
             "created_at": (_NOW - timedelta(days=10)).isoformat()}
        ] * 5
        out = RecentPatternsAgent().compute(
            _inp(feedback_history=feedback, agent_prefs={"window_days": 14})
        )
        assert "5 tip reactions" in out.prompt_text
    def test_snapshot_includes_lookback_days(self):
        out = RecentPatternsAgent().compute(_inp(agent_prefs={"lookback_days": 14}))
        assert out.signals_snapshot["lookback_days"] == 14
 class TestRecentPatternsWeeklyCycle:
    def test_cold_start_returns_empty(self):
        history = _history(*[_event("done") for _ in range(5)])  # below min_history=21
        result = run_inference(RECENT_MANIFEST, history)
        assert result["weekly_cycle"] == []
    def _events_on_dow(self, target_dow: int, count: int, n_weeks: int = 4) -> list[FeedbackEvent]:
        """Generate `count` done events per week on `target_dow` (0=Mon…6=Sun).
        _NOW is Thursday (weekday=3). days_back = (now_dow - target_dow) % 7
        gives the offset to the most recent occurrence of target_dow.
        """
        now_dow = _NOW.weekday()  # 3 = Thursday
        days_back = (now_dow - target_dow) % 7
        if days_back == 0:
            days_back = 7  # avoid "today" — use the previous occurrence
        events = []
        for week in range(n_weeks):
            offset = days_back + week * 7
            for _ in range(count):
                events.append(_done_at(offset + 0.1, hour=11))
        return events
    def _weekend_warrior_history(self) -> UserHistory:
        """Many done events on Sat/Sun (dow 5 & 6), few on Tuesday (dow 1)."""
        events = []
        events += self._events_on_dow(5, count=5)   # Saturday
        events += self._events_on_dow(6, count=5)   # Sunday
        events += self._events_on_dow(1, count=1)   # Tuesday — one per week
        return _history(*events)
    def test_weekend_warrior_strong_on_weekends(self):
        history = self._weekend_warrior_history()
        result = run_inference(RECENT_MANIFEST, history)
        by_dow = {e["dow"]: e["strength"] for e in result["weekly_cycle"]}
        assert by_dow.get(5, 0) > 1.0   # Saturday
        assert by_dow.get(6, 0) > 1.0   # Sunday
    def test_weekday_only_low_weekend_strength(self):
        events = []
        for dow in range(5):  # Monday–Friday
            events += self._events_on_dow(dow, count=3)
        # Saturday (5) and Sunday (6) get zero events
        history = _history(*events)
        result = run_inference(RECENT_MANIFEST, history)
        by_dow = {e["dow"]: e["strength"] for e in result["weekly_cycle"]}
        assert by_dow.get(5, 0) == 0.0   # Saturday
        assert by_dow.get(6, 0) == 0.0   # Sunday
    def test_snippet_includes_cycle_hint_when_strong(self):
        # Inject a strong weekly_cycle pref directly
        prefs = {
            "lookback_days": 7,
            "weekly_cycle": [{"dow": 1, "strength": 2.0, "sample": "completes most Tuesdays"}],
            "daily_cycle": [],
        }
        out = RecentPatternsAgent().compute(_inp(agent_prefs=prefs))
        assert "Tuesday" in out.prompt_text
    def test_snippet_omits_cycle_hint_when_weak(self):
        prefs = {
            "lookback_days": 7,
            "weekly_cycle": [{"dow": 1, "strength": 0.3, "sample": "completes most Tuesdays"}],
            "daily_cycle": [],
        }
        out = RecentPatternsAgent().compute(_inp(agent_prefs=prefs))
        assert "Tuesday" not in out.prompt_text
 class TestRecentPatternsDailyCycle:
    def test_cold_start_returns_empty(self):
        history = _history(*[_event("done") for _ in range(5)])  # below min_history=14
        result = run_inference(RECENT_MANIFEST, history)
        assert result["daily_cycle"] == []
    def _evening_person_history(self) -> UserHistory:
        """Many done events at 20:00–21:00, few in the morning."""
        events = []
        for d in range(20):
            for _ in range(4):
                events.append(_done_at(d + 0.5, hour=20))
            events.append(_done_at(d + 0.5, hour=9))
        return _history(*events)
    def test_evening_person_strong_at_evening_hours(self):
        history = self._evening_person_history()
        result = run_inference(RECENT_MANIFEST, history)
        by_hour = {e["hour"]: e["strength"] for e in result["daily_cycle"]}
        assert by_hour.get(20, 0) > 1.0
        assert by_hour.get(9, 0) < by_hour.get(20, 0)
    def test_snippet_includes_daily_hint_when_strong(self):
        prefs = {
            "lookback_days": 7,
            "weekly_cycle": [],
            "daily_cycle": [{"hour": 20, "strength": 3.0}],
        }
        out = RecentPatternsAgent().compute(_inp(agent_prefs=prefs))
        assert "8pm" in out.prompt_text
    def test_snippet_omits_daily_hint_when_weak(self):
        prefs = {
            "lookback_days": 7,
            "weekly_cycle": [],
            "daily_cycle": [{"hour": 20, "strength": 0.4}],
        }
        out = RecentPatternsAgent().compute(_inp(agent_prefs=prefs))
        assert "8pm" not in out.prompt_text
    def test_no_pattern_user_no_hints(self):
        # Uniform distribution across all hours → strength ≈ 1.0 everywhere → no strong peaks
        events = [_done_at(d + 0.5, hour=h) for d in range(3) for h in range(24)]
        history = _history(*events)
        result = run_inference(RECENT_MANIFEST, history)
        strong = [e for e in result["daily_cycle"] if e["strength"] > 0.5]
        # Uniform distribution → all strengths ≈ 1.0; but none dramatically above threshold
        # Since strength = count/mean and all counts are equal, all = 1.0 exactly
        # 1.0 is not > 0.5 threshold in snippet rendering, but IS > 0.5 so they'd show.
        # For a flat distribution the caller sees no meaningful peak — verify no strength > 2
        assert all(e["strength"] <= 1.1 for e in result["daily_cycle"])
    def test_version_bumped(self):
        assert RECENT_MANIFEST.version == "1.2.0"
 # ── time-of-day: quiet_start/end + peak_hours inference (#112) ───────────────
 def _tod_event(action: str, hour: int, days_ago: float = 1.0) -> FeedbackEvent:
    """Feedback event at a specific hour N days ago."""
    from datetime import timedelta
    dt = (_NOW - timedelta(days=days_ago)).replace(hour=hour, minute=0, second=0, microsecond=0)
    return FeedbackEvent(action=action, dwell_ms=60_000, created_at=dt.isoformat())
 def _tod_history(*events: FeedbackEvent) -> UserHistory:
    return UserHistory(user_id="u1", events=list(events))
 class TestTimeOfDayQuietWindow:
    def test_cold_start_below_min_history(self):
        history = _tod_history(*[_tod_event("done", 10) for _ in range(10)])
        result = run_inference(TOD_MANIFEST, history)
        assert result["quiet_start"] == "22:00"
        assert result["quiet_end"] == "07:00"
    def _night_owl_history(self) -> UserHistory:
        """Active 20:00–23:00, quiet 02:00–14:00."""
        events = []
        for d in range(10):
            for h in [20, 21, 22, 23, 0, 1]:
                events.append(_tod_event("done", h, days_ago=d + 0.5))
            # Sparse during day
            events.append(_tod_event("done", 15, days_ago=d + 0.5))
        return _tod_history(*events)
    def _early_bird_history(self) -> UserHistory:
        """Active 06:00–10:00, quiet 21:00–05:00."""
        events = []
        for d in range(10):
            for h in [6, 7, 8, 9, 10]:
                events.append(_tod_event("done", h, days_ago=d + 0.5))
            events.append(_tod_event("done", 14, days_ago=d + 0.5))
        return _tod_history(*events)
    def test_early_bird_quiet_in_evening(self):
        history = self._early_bird_history()
        result = run_inference(TOD_MANIFEST, history)
        # Quiet window should be in the evening/night range
        start_h = int(result["quiet_start"].split(":")[0])
        end_h = int(result["quiet_end"].split(":")[0])
        # Quiet window spans from some evening hour into morning
        assert start_h >= 18 or end_h <= 10  # covers night
    def test_quiet_window_wraps_midnight(self):
        # Night owl: heavy activity in evening, quiet 02:00–14:00
        history = self._night_owl_history()
        result = run_inference(TOD_MANIFEST, history)
        start_h = int(result["quiet_start"].split(":")[0])
        end_h = int(result["quiet_end"].split(":")[0])
        # The quiet window should span across midnight or be in daylight
        # (start > end means wraps midnight)
        is_wrapping = start_h > end_h
        is_daytime = 2 <= start_h <= 14
        assert is_wrapping or is_daytime
    def test_format_is_hhmm(self):
        history = self._early_bird_history()
        result = run_inference(TOD_MANIFEST, history)
        import re
        assert re.match(r"^\d{2}:00$", result["quiet_start"])
        assert re.match(r"^\d{2}:00$", result["quiet_end"])
 class TestTimeOfDayPeakHours:
    def _evening_person_history(self, n: int = 60) -> UserHistory:
        """Heavy done events at 19:00 and 20:00, light elsewhere."""
        events = []
        for i in range(n):
            events.append(_tod_event("done", 19, days_ago=i * 0.5))
            events.append(_tod_event("done", 20, days_ago=i * 0.5))
            events.append(_tod_event("done", 10, days_ago=i * 0.5))  # low volume
        return _tod_history(*events)
    def test_cold_start_returns_default(self):
        history = _tod_history(*[_tod_event("done", 10) for _ in range(5)])
        result = run_inference(TOD_MANIFEST, history)
        assert result["peak_hours"] == [9, 14, 20]
    def test_evening_person_peak_hours_in_evening(self):
        history = self._evening_person_history()
        result = run_inference(TOD_MANIFEST, history)
        assert 19 in result["peak_hours"] or 20 in result["peak_hours"]
    def test_peak_hours_sorted(self):
        history = self._evening_person_history()
        result = run_inference(TOD_MANIFEST, history)
        assert result["peak_hours"] == sorted(result["peak_hours"])
    def test_shift_worker_peaks_at_unusual_hours(self):
        """Shift worker active at 02:00 and 03:00."""
        events = [_tod_event("done", h, days_ago=i * 0.5)
                  for i in range(30) for h in [2, 3]]
        events += [_tod_event("done", 14, days_ago=i * 0.5) for i in range(5)]
        history = _tod_history(*events)
        result = run_inference(TOD_MANIFEST, history)
        assert 2 in result["peak_hours"] or 3 in result["peak_hours"]
 class TestTimeOfDaySnippet:
    agent = TimeOfDayAgent()
    def _inp_at(self, hour: int, **prefs) -> AgentInput:
        from datetime import timedelta
        now = _NOW.replace(hour=hour)
        return _inp(now=now, agent_prefs=prefs)
    def test_in_peak_hour_says_peak(self):
        out = self.agent.compute(self._inp_at(20, peak_hours=[20]))
        assert "peak productivity hour" in out.prompt_text
    def test_approaching_peak_says_approaching(self):
        out = self.agent.compute(self._inp_at(18, peak_hours=[20]))
        assert "approaching" in out.prompt_text.lower()
    def test_quiet_window_overrides_peak(self):
        # Even if hour is in peak_hours, quiet window wins
        out = self.agent.compute(
            self._inp_at(23, quiet_start="22:00", quiet_end="07:00", peak_hours=[23])
        )
        assert "quiet window" in out.prompt_text
    def test_tz_shown_when_not_utc(self):
        out = self.agent.compute(self._inp_at(10, tz="Europe/Moscow"))
        assert "Europe/Moscow" in out.prompt_text
    def test_snapshot_includes_peak_and_quiet(self):
        out = self.agent.compute(self._inp_at(10, peak_hours=[10], quiet_start="22:00", quiet_end="07:00"))
        assert "peak_hours" in out.signals_snapshot
        assert "in_quiet" in out.signals_snapshot
        assert "in_peak" in out.signals_snapshot
    def test_version_bumped(self):
        assert TOD_MANIFEST.version == "1.2.0"
    def test_manifest_has_new_params(self):
        keys = {p.key for p in TOD_MANIFEST.inferred_params}
        assert {"quiet_start", "quiet_end", "peak_hours", "tz"}.issubset(keys)
 # ── focus-area: cluster summary output ───────────────────────────────────────
 class TestFocusAreaOutput:
    agent = FocusAreaAgent()
    def _task(self, content: str, project_id: str) -> dict:
        return {"id": "t1", "content": content, "is_overdue": False,
                "task_age_days": 2.0, "priority": 1, "project_id": project_id}
    def test_version(self):
        from ml.agents.focus_area import MANIFEST as FA_MANIFEST
        assert FA_MANIFEST.version == "3.0.0"
    def test_all_clusters_in_output(self):
        tasks = [self._task("Work thing", "work"), self._task("Home thing", "home")]
        out = self.agent.compute(_inp(tasks=tasks))
        assert "work" in out.prompt_text.lower()
        assert "home" in out.prompt_text.lower()
    def test_task_titles_in_output(self):
        tasks = [self._task("Buy milk", "personal")]
        out = self.agent.compute(_inp(tasks=tasks))
        assert '"Buy milk"' in out.prompt_text
    def test_snapshot_shape(self):
        tasks = [self._task("T", "work")]
        out = self.agent.compute(_inp(tasks=tasks))
        public_keys = {k for k in out.signals_snapshot if not k.startswith("_")}
        assert public_keys == {"cluster_count", "clusters"}
        assert isinstance(out.signals_snapshot["clusters"], list)
    def test_no_inferred_params(self):
        from ml.agents.focus_area import MANIFEST as FA_MANIFEST
        assert FA_MANIFEST.inferred_params == []
--- a/ml/agents/time_of_day.py
+++ b/ml/agents/time_of_day.py
@@ -0,0 +1,266 @@
 from __future__ import annotations
 import statistics
 from collections import Counter
 from typing import ClassVar
 from .base import BaseAgent, AgentInput, AgentOutput
 from .inference.history import UserHistory
 from .manifest import AgentManifest, InferredParam
 _DOW_NAMES = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
 # min_history required before quiet/peak inference is meaningful (issue #112)
 _MIN_HISTORY = 50
 def _infer_preferred_hour(history: UserHistory) -> int:
    """Mode hour of day across all 'done' feedback events; falls back to 9."""
    done_hours = [e.hour for e in history.events if e.action == "done"]
    if not done_hours:
        return 9
    return Counter(done_hours).most_common(1)[0][0]
 def _quiet_window_hours(history: UserHistory) -> tuple[int, int]:
    """Return (start_hour, end_hour) of the longest below-baseline quiet window.
    Counts all engagement events by hour. Baseline = mean hourly count.
    Finds the longest contiguous run of below-baseline hours on the circular
    clock; that run defines the quiet window.
    """
    by_hour: Counter[int] = Counter(e.hour for e in history.events)
    total = sum(by_hour.values())
    baseline = total / 24
    # Mark each of the 24 hours as below-baseline (True = quiet)
    quiet: list[bool] = [by_hour.get(h, 0) < baseline for h in range(24)]
    # Find longest contiguous run in circular array
    best_start, best_len = 0, 0
    run_start, run_len = 0, 0
    # Double the sequence to handle wrap-around
    for i in range(48):
        h = i % 24
        if quiet[h]:
            if run_len == 0:
                run_start = i
            run_len += 1
            if run_len > best_len:
                best_len = run_len
                best_start = run_start
        else:
            run_len = 0
    if best_len == 0:
        return (22, 7)  # fallback
    start = best_start % 24
    end = (best_start + best_len) % 24
    return (start, end)
 def _infer_quiet_start(history: UserHistory) -> str:
    start, _ = _quiet_window_hours(history)
    return f"{start:02d}:00"
 def _infer_quiet_end(history: UserHistory) -> str:
    _, end = _quiet_window_hours(history)
    return f"{end:02d}:00"
 def _infer_peak_hours(history: UserHistory) -> list[int]:
    """Top-quartile hours by done-event count.
    Computes done_count per hour, then returns hours above the 75th percentile
    of non-zero hourly counts, sorted ascending.
    """
    done_by_hour: Counter[int] = Counter(
        e.hour for e in history.events if e.action == "done"
    )
    if not done_by_hour:
        return [9, 14, 20]
    counts = list(done_by_hour.values())
    threshold = statistics.quantiles(counts, n=4)[-1]  # 75th percentile
    return sorted(h for h, c in done_by_hour.items() if c >= threshold)
 MANIFEST = AgentManifest(
    id="time-of-day",
    version="1.2.0",  # #112: quiet_start/end + peak_hours + tz inference
    description="Frames the current moment relative to the user's productive peak and quiet hours.",
    pref_schema={
        "type": "object",
        "additionalProperties": False,
        "properties": {
            "quiet_start": {
                "type": "string",
                "pattern": "^([01][0-9]|2[0-3]):[0-5][0-9]$",
                "description": "HH:MM start of quiet hours (24h, user's local TZ).",
            },
            "quiet_end": {
                "type": "string",
                "pattern": "^([01][0-9]|2[0-3]):[0-5][0-9]$",
                "description": "HH:MM end of quiet hours.",
            },
            "peak_hours": {
                "type": "array",
                "items": {"type": "integer", "minimum": 0, "maximum": 23},
                "default": [9, 14, 20],
                "description": "Hours (0–23) with top-quartile completion density.",
            },
            "tz": {
                "type": "string",
                "default": "UTC",
                "description": "IANA timezone; populated from auth provider, fallback UTC.",
            },
            "preferred_hour": {
                "type": "integer",
                "minimum": 0,
                "maximum": 23,
                "description": "Mode done-hour (legacy; superseded by peak_hours).",
            },
        },
    },
    context_schema=["profile.features"],
    required_consents=["data:core"],
    output_contract={"type": "snippet", "format": "free_text"},
    ttl_sec=900,
    inferred_params=[
        InferredParam(
            key="preferred_hour",
            ttl_sec=3_600,
            cold_start_default=None,
            min_history=10,
            infer=_infer_preferred_hour,
        ),
        InferredParam(
            key="quiet_start",
            ttl_sec=86_400,
            cold_start_default="22:00",
            min_history=_MIN_HISTORY,
            infer=_infer_quiet_start,
        ),
        InferredParam(
            key="quiet_end",
            ttl_sec=86_400,
            cold_start_default="07:00",
            min_history=_MIN_HISTORY,
            infer=_infer_quiet_end,
        ),
        InferredParam(
            key="peak_hours",
            ttl_sec=86_400,
            cold_start_default=[9, 14, 20],
            min_history=_MIN_HISTORY,
            infer=_infer_peak_hours,
        ),
        # tz is populated from the auth provider; no infer function.
        InferredParam(
            key="tz",
            ttl_sec=86_400,
            cold_start_default="UTC",
            min_history=999_999,   # effectively never inferred — always cold_start
            infer=None,
        ),
    ],
 )
 class TimeOfDayAgent(BaseAgent):
    """Frames the current moment relative to the user's productive peak."""
    agent_id: ClassVar[str] = MANIFEST.id
    ttl_seconds: ClassVar[int] = MANIFEST.ttl_sec
    version: ClassVar[str] = MANIFEST.version
    def compute(self, inp: AgentInput) -> AgentOutput:
        hour = inp.now.hour
        dow = inp.now.weekday()
        is_weekend = dow >= 5
        preferred_raw = inp.agent_prefs.get("preferred_hour", inp.profile.get("preferred_hour"))
        preferred = int(preferred_raw) if preferred_raw is not None else None
        quiet_start: str | None = inp.agent_prefs.get("quiet_start")
        quiet_end: str | None = inp.agent_prefs.get("quiet_end")
        peak_hours: list[int] = inp.agent_prefs.get("peak_hours", [])
        tz: str = inp.agent_prefs.get("tz", "UTC")
        in_quiet = self._in_quiet_window(hour, quiet_start, quiet_end)
        in_peak = hour in peak_hours
        parts = [f"It is {hour:02d}:00 on {_DOW_NAMES[dow]} ({self._label(hour)})."]
        if tz != "UTC":
            parts[0] = f"It is {hour:02d}:00 ({tz}) on {_DOW_NAMES[dow]} ({self._label(hour)})."
        if is_weekend:
            parts.append("Weekend context — prefer personal or reflective tips over work tasks.")
        if in_quiet:
            parts.append(
                f"User is in their quiet window ({quiet_start}–{quiet_end}) — "
                "avoid urgent or demanding tips."
            )
        elif in_peak:
            parts.append(
                f"Hour {hour:02d}:00 is a peak productivity hour for this user — "
                "a high-impact or challenging tip is appropriate."
            )
        elif peak_hours:
            # Report nearest peak so orchestrator can time advice accordingly.
            nearest = min(peak_hours, key=lambda p: min(abs(p - hour), 24 - abs(p - hour)))
            delta = min(abs(nearest - hour), 24 - abs(nearest - hour))
            if delta <= 2:
                parts.append(f"Approaching peak productivity window ({nearest:02d}:00).")
        elif preferred is not None:
            delta = min(abs(hour - preferred), 24 - abs(hour - preferred))
            if delta == 0:
                parts.append(
                    f"This is the user's peak productivity hour ({preferred:02d}:00) — "
                    "a high-impact tip is appropriate."
                )
            elif delta <= 2:
                parts.append(f"Approaching the user's peak productivity window ({preferred:02d}:00).")
        else:
            parts.append("No preferred-hour data yet.")
        prompt = " ".join(parts)
        snapshot = {
            "hour": hour,
            "day_of_week": dow,
            "preferred_hour": preferred,
            "quiet_start": quiet_start,
            "quiet_end": quiet_end,
            "peak_hours": peak_hours,
            "in_quiet": in_quiet,
            "in_peak": in_peak,
            "tz": tz,
        }
        return self._make_output(inp, prompt, snapshot)
    @staticmethod
    def _in_quiet_window(hour: int, start: str | None, end: str | None) -> bool:
        if not start or not end:
            return False
        try:
            sh = int(start.split(":")[0])
            eh = int(end.split(":")[0])
        except (ValueError, IndexError):
            return False
        if sh <= eh:
            return sh <= hour < eh
        # wraps midnight e.g. 22:00–07:00
        return hour >= sh or hour < eh
    @staticmethod
    def _label(hour: int) -> str:
        if 5 <= hour < 12:
            return "morning"
        if 12 <= hour < 17:
            return "afternoon"
        if 17 <= hour < 21:
            return "evening"
        return "night"
--- a/ml/experiments/bench/README.md
+++ b/ml/experiments/bench/README.md
@@ -0,0 +1,85 @@
 # `bench/` — combined model + prompt evaluation harness
 Combines the work of issues **#93** (model benchmark) and **#95** (prompt
 A/B) into one MLflow-tracked experiment. Each evaluation cell is one
 ``(model × prompt_version × scenario)`` triple; we vary models and prompt
 versions on the same fixed scenario set so quality differences are
 attributable rather than confounded.
 ## Pieces
 | File | Purpose |
 |------|---------|
 | `rubric.md`         | The scoring rubric (`tip-v1`). Anchor for the human judge across sessions. |
 | `scenarios.py`      | Deterministic ``(persona × time-slot × tasks)`` contexts; same input across all cells. |
 | `mlflow_client.py`  | Thin httpx-based MLflow REST wrapper. Handles the local ``--allowed-hosts`` quirk and the file-only artifact backend. |
 | `collect.py`        | **Phase A.** Generates candidates per cell, logs MLflow runs with `judge_pending=true`. |
 | `judge_cli.py`      | **Phase B.** `--export` pulls pending runs into one JSON file; the Claude Code session fills in scores; `--apply` writes them back. |
 | `compare.py`        | **Phase C.** Leaderboard per ``(model, prompt)`` cell. |
 ## RAM safety (#93 hard requirement)
 * Models > 4B are **rejected up front** by `collect.py --max-model-b 4.0`.
 * Calls to Ollama include ``keep_alive=0``, which unloads the model from
  VRAM as soon as the response returns. We never hold two LLM weights
  concurrently.
 * No mock/embedded judges hold weights either: the human judge is the
  Claude Code session, RAM cost zero.
 The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end
 to end without paging.
 ## Quick start
 ```bash
 # 1. Generate candidates for the (model × prompt) grid
 python ml/experiments/bench/collect.py \
    --models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \
    --prompts v1,v2-mentor,v3-few-shot \
    --experiment tip-bench-2026-04-27 \
    --n-tips 5 \
    --diversity
 # 2. Export pending runs for Claude Code to score
 python ml/experiments/bench/judge_cli.py \
    --experiment tip-bench-2026-04-27 \
    --export /tmp/oo-bench-judge.json
 # 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.)
 # 4. Push scores back to MLflow
 python ml/experiments/bench/judge_cli.py \
    --experiment tip-bench-2026-04-27 \
    --apply /tmp/oo-bench-judge.json
 # 5. Leaderboard
 python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27
 ```
 ## Why the rubric matters
 Different judging sessions need to be comparable. `rubric.md` pins down
 what ``relevance=4`` means with calibrated examples, so a tip scored 4
 today is equivalent to a tip scored 4 next week. Without the rubric, the
 "lazy human-in-the-loop" judge drifts.
 ## Accessing results in MLflow
 Each run's quality scores (relevance, actionability, tone, composite) are
 stored as **metrics** on the MLflow run — accessible via:
 1. **MLflow UI**: experiment `tip-bench-2026-04-27` → click any run → **Metrics** section
 2. **Leaderboard**: `python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27`
 3. **Raw API**: `mlflow_client.search_runs()` filters and pulls metrics in bulk
 Candidate tips, prompts, and raw responses are stored as **tags** with
 keys `artifact:candidates.json`, `artifact:prompt.txt`, `artifact:raw.txt`
 (tag fallback because the MLflow server uses a file:// artifact backend
 not accessible via REST from the host).
 ## Running standalone
 The pipeline runs on any machine with:
 - Ollama models ≤4B
 - MLflow tracking server
 - Python 3.10+
--- a/ml/experiments/bench/init.py
+++ b/ml/experiments/bench/init.py
@@ -0,0 +1,18 @@
 """oO tip-generation benchmark harness.
 Combines model evaluation (#93) and prompt A/B testing (#95) into one
 MLflow-tracked experiment. Each evaluation cell is one (model × prompt ×
 scenario) triple; we vary models and prompts on the same fixed scenario
 set so quality differences are attributable rather than confounded.
 The pipeline follows the lazy-judge pattern: collect candidates with
 deterministic metrics (latency, format_ok), export to a JSON file for
 Claude Code to score per the rubric, apply scores back to MLflow, and
 generate a leaderboard.
 RAM safety is enforced: models >4B are rejected, Ollama calls use
 keep_alive=0 to unload VRAM immediately, and the human judge (Claude Code
 session) has zero inference cost.
 See README.md for usage.
 """
--- a/ml/experiments/bench/collect.py
+++ b/ml/experiments/bench/collect.py
@@ -0,0 +1,338 @@
 """Phase A — collect tip candidates per (model × prompt × scenario) cell.
 Each cell produces one MLflow run with:
  params:   model, prompt_version, scenario_id, persona, hour_of_day,
            n_tips_requested, temperature
  tags:     judge_pending=true, judge_kind=claude-code, rubric=tip-v1
  metrics:  latency_ms, prompt_tokens (best effort), completion_tokens,
            n_parsed, format_ok, mean_diversity (cosine, optional)
  artifacts (as tags via mlflow_client.log_text):
            prompt.txt          system + user prompt as sent
            candidates.json     parsed candidate array
            raw.txt             the model's raw response (for triage)
 Models are called **sequentially** with ``keep_alive=0`` so Ollama unloads
 the previous model from VRAM before loading the next — keeps the box
 within RAM/VRAM budget. Models > 4B are rejected up front.
 Usage:
    python collect.py \\
        --models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \\
        --prompts v1,v2-mentor,v3-few-shot \\
        --n-tips 5 \\
        --experiment tip-bench-2026-04-27
 """
 from __future__ import annotations
 import argparse
 import json
 import math
 import os
 import re
 import sys
 import time
 from dataclasses import asdict
 from pathlib import Path
 import httpx
 _BENCH = Path(__file__).resolve().parent
 _ML = _BENCH.parent.parent
 sys.path.insert(0, str(_BENCH))
 sys.path.insert(0, str(_BENCH.parent / "sim"))
 sys.path.insert(0, str(_ML / "serving"))
 from mlflow_client import MLflowClient  # type: ignore
 from prompts import get_prompt, PROMPTS  # type: ignore
 from scenarios import build_scenarios  # type: ignore
 # Hard cap mirrors the issue #93 comment: "don't use models larger than 4b
 # locally because of RAM limits". A regex cheap-match on the tag handles
 # the common ``name:Nb`` and ``name:N.Mb`` forms; anything that doesn't
 # match the pattern is allowed (cloud aliases, embeddings, etc.).
 _SIZE_TAG = re.compile(r":(\d+(?:\.\d+)?)b\b", re.IGNORECASE)
 def _model_too_big(model: str, max_b: float = 4.0) -> bool:
    m = _SIZE_TAG.search(model)
    if not m:
        return False
    return float(m.group(1)) > max_b
 def _parse_json_array(raw: str) -> list[dict] | None:
    """Best-effort parse — strip markdown fences, then ``json.loads``."""
    text = raw.strip()
    if text.startswith("```"):
        parts = text.split("```")
        text = parts[1] if len(parts) > 1 else text
        if text.lstrip().lower().startswith("json"):
            text = text.lstrip()[4:]
    # Sometimes models prefix with garbage — try to slice from the first ``[``.
    if not text.lstrip().startswith("["):
        i = text.find("[")
        if i >= 0:
            text = text[i:]
    try:
        v = json.loads(text)
        return v if isinstance(v, list) else None
    except (json.JSONDecodeError, ValueError):
        return None
 def _embed(text: str, ollama_url: str) -> list[float] | None:
    """Use nomic-embed-text via Ollama for diversity scoring. ~250MB,
    safe to load alongside any 4B chat model thanks to ``keep_alive=0``.
    """
    try:
        with httpx.Client(trust_env=False, timeout=30.0) as c:
            r = c.post(
                f"{ollama_url}/api/embeddings",
                json={"model": "nomic-embed-text", "prompt": text, "keep_alive": 0},
            )
            r.raise_for_status()
            return r.json().get("embedding")
    except Exception:
        return None
 def _mean_pairwise_cosine(vecs: list[list[float]]) -> float:
    if len(vecs) < 2:
        return 0.0
    def cos(a: list[float], b: list[float]) -> float:
        na = math.sqrt(sum(x * x for x in a))
        nb = math.sqrt(sum(x * x for x in b))
        if na == 0 or nb == 0:
            return 0.0
        return sum(x * y for x, y in zip(a, b)) / (na * nb)
    n = len(vecs)
    total, count = 0.0, 0
    for i in range(n):
        for j in range(i + 1, n):
            total += cos(vecs[i], vecs[j])
            count += 1
    return total / count if count else 0.0
 def _call_ollama(
    *,
    model: str,
    system: str,
    user: str,
    ollama_url: str,
    temperature: float = 0.7,
 ) -> tuple[str, dict]:
    """Direct call to Ollama. Returns (raw_text, telemetry).
    ``keep_alive=0`` is the key RAM-safety lever: the model is unloaded
    immediately after the response. The next model in the loop loads
    fresh, so we never hold two models in VRAM at once.
    """
    t0 = time.perf_counter()
    body = {
        "model": model,
        "messages": [
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        "stream": False,
        "keep_alive": 0,
        "options": {"temperature": temperature},
    }
    with httpx.Client(trust_env=False, timeout=180.0) as c:
        r = c.post(f"{ollama_url}/api/chat", json=body)
        r.raise_for_status()
        data = r.json()
    elapsed_ms = (time.perf_counter() - t0) * 1000.0
    raw = data.get("message", {}).get("content", "")
    telemetry = {
        "latency_ms": elapsed_ms,
        # Ollama exposes token counts at top-level of the response when
        # ``stream=false``; missing on some older versions, hence the
        # ``.get`` defaults.
        "prompt_tokens": float(data.get("prompt_eval_count", 0) or 0),
        "completion_tokens": float(data.get("eval_count", 0) or 0),
    }
    return raw, telemetry
 def main() -> int:
    parser = argparse.ArgumentParser(description="oO tip-generation benchmark — Phase A")
    parser.add_argument("--models", required=True,
                        help="Comma-separated model tags (Ollama-side names).")
    parser.add_argument("--prompts", default=",".join(PROMPTS.keys()),
                        help="Comma-separated prompt versions from ml/serving/prompts.py.")
    parser.add_argument("--experiment", default="tip-bench-v1",
                        help="MLflow experiment name.")
    parser.add_argument("--n-tips", type=int, default=5,
                        help="Tips to request per scenario.")
    parser.add_argument("--temperature", type=float, default=0.7)
    parser.add_argument("--ollama-url", default=os.environ.get("OLLAMA_URL", "http://localhost:11434"))
    parser.add_argument("--mlflow-url", default=os.environ.get("MLFLOW_TRACKING_URI", "http://localhost:5000"))
    parser.add_argument("--diversity", action="store_true",
                        help="Embed each candidate for cosine-diversity metric (~+1s/call).")
    parser.add_argument("--max-model-b", type=float, default=4.0,
                        help="Reject models tagged larger than this many billion params.")
    parser.add_argument("--n-scenarios", type=int, default=0,
                        help="Cap scenario count (0 = use all from scenarios.py).")
    parser.add_argument("--rubric", default=str(_BENCH / "rubric.md"),
                        help="Rubric file logged once per experiment.")
    args = parser.parse_args()
    models = [m.strip() for m in args.models.split(",") if m.strip()]
    prompts = [p.strip() for p in args.prompts.split(",") if p.strip()]
    too_big = [m for m in models if _model_too_big(m, args.max_model_b)]
    if too_big:
        print(f"ERROR: models exceed --max-model-b={args.max_model_b}: {too_big}", file=sys.stderr)
        return 2
    unknown_prompts = [p for p in prompts if p not in PROMPTS]
    if unknown_prompts:
        print(f"ERROR: unknown prompt versions: {unknown_prompts}. "
              f"Available: {list(PROMPTS)}", file=sys.stderr)
        return 2
    scenarios = build_scenarios()
    if args.n_scenarios and args.n_scenarios < len(scenarios):
        scenarios = scenarios[:args.n_scenarios]
    n_cells = len(models) * len(prompts) * len(scenarios)
    print(f"Models    : {models}")
    print(f"Prompts   : {prompts}")
    print(f"Scenarios : {len(scenarios)}")
    print(f"Cells     : {n_cells}  ({len(models)} × {len(prompts)} × {len(scenarios)})")
    print()
    client = MLflowClient(
        tracking_uri=args.mlflow_url,
        username=os.environ.get("MLFLOW_TRACKING_USERNAME") or "admin",
        password=os.environ.get("MLFLOW_TRACKING_PASSWORD") or "password",
    )
    exp_id = client.get_or_create_experiment(args.experiment)
    print(f"MLflow experiment: {args.experiment}  (id={exp_id})")
    rubric_text = Path(args.rubric).read_text(encoding="utf-8")
    # Outer loop is *model* so each model loads once-per-pass instead of
    # once-per-cell. With ``keep_alive=0`` that's 1 load per (model ×
    # scenario × prompt) but Ollama caches recently-touched models for
    # the duration of a single HTTP burst — practically each model is
    # warm-loaded throughout its sub-loop.
    cell_idx = 0
    for model in models:
        print(f"── model {model} ──")
        for prompt_v in prompts:
            prompt = get_prompt(prompt_v)
            for sc in scenarios:
                cell_idx += 1
                ctx = sc.to_prompt_context()
                class _Ctx:
                    pass
                _ctx = _Ctx()
                _ctx.tasks = ctx["tasks"]
                _ctx.hour_of_day = ctx["hour_of_day"]
                _ctx.day_of_week = ctx["day_of_week"]
                _ctx.extra = ctx["extra"]
                user_msg = prompt.build_user(_ctx, args.n_tips)
                run_id = client.create_run(
                    exp_id,
                    run_name=f"{model}__{prompt_v}__{sc.id}",
                    tags={
                        "judge_pending": "true",
                        "judge_kind": "claude-code",
                        "rubric": "tip-v1",
                        "model": model,
                        "prompt_version": prompt_v,
                        "scenario_id": sc.id,
                        "persona": sc.persona.name,
                    },
                )
                client.log_params(run_id, {
                    "model": model,
                    "prompt_version": prompt_v,
                    "scenario_id": sc.id,
                    "persona": sc.persona.name,
                    "hour_of_day": sc.hour_of_day,
                    "day_of_week": sc.day_of_week,
                    "n_tips_requested": args.n_tips,
                    "temperature": args.temperature,
                })
                try:
                    raw, telemetry = _call_ollama(
                        model=model,
                        system=prompt.system,
                        user=user_msg,
                        ollama_url=args.ollama_url,
                        temperature=args.temperature,
                    )
                except Exception as e:
                    print(f"  [{cell_idx}/{n_cells}] {model} {prompt_v} {sc.id}: ERROR {e}")
                    client.set_tag(run_id, "error", str(e)[:500])
                    client.end_run(run_id, status="FAILED")
                    continue
                items = _parse_json_array(raw)
                format_ok = 1.0 if items is not None else 0.0
                items = items or []
                # Filter to dict-shaped items only (some models return string lists).
                cand_dicts = [
                    {
                        "id": str(it.get("id", f"tip-{i}")),
                        "content": str(it.get("content", "")),
                        "rationale": str(it.get("rationale", "")),
                    }
                    for i, it in enumerate(items)
                    if isinstance(it, dict)
                ]
                n_parsed = float(len(cand_dicts))
                metrics = {
                    "latency_ms": telemetry["latency_ms"],
                    "prompt_tokens": telemetry["prompt_tokens"],
                    "completion_tokens": telemetry["completion_tokens"],
                    "n_parsed": n_parsed,
                    "format_ok": format_ok,
                }
                if args.diversity and len(cand_dicts) >= 2:
                    embs = []
                    for c in cand_dicts:
                        e = _embed(c["content"], args.ollama_url)
                        if e:
                            embs.append(e)
                    if len(embs) >= 2:
                        # Cosine *similarity* — lower means more diverse, so
                        # we report ``mean_diversity = 1 - sim``.
                        sim = _mean_pairwise_cosine(embs)
                        metrics["mean_diversity"] = 1.0 - sim
                client.log_metrics(run_id, metrics)
                client.log_text(run_id, prompt.system + "\n\n---\n\n" + user_msg, "prompt.txt")
                client.log_text(run_id, json.dumps(cand_dicts, indent=2), "candidates.json")
                client.log_text(run_id, raw[:9_000], "raw.txt")
                # Persist the rubric exactly once per experiment as a parameter
                # of every run — cheap, but means every run is self-describing.
                client.set_tag(run_id, "rubric_md", rubric_text[: client._TAG_VALUE_LIMIT])
                client.end_run(run_id)
                print(f"  [{cell_idx:>3}/{n_cells}] {model:18s} {prompt_v:12s} {sc.id:24s}  "
                      f"lat={metrics['latency_ms']:>6.0f}ms  parsed={int(n_parsed)}/{args.n_tips}  "
                      f"fmt={int(format_ok)}")
    print()
    print(f"Phase A complete. Run judge_cli.py --export to score pending runs.")
    print(f"  python ml/experiments/bench/judge_cli.py --experiment {args.experiment} \\")
    print(f"      --export /tmp/oo-bench-judge-requests.json")
    return 0
 if __name__ == "__main__":
    sys.exit(main())
--- a/ml/experiments/bench/compare.py
+++ b/ml/experiments/bench/compare.py
@@ -0,0 +1,144 @@
 """Phase C — leaderboard from judged MLflow runs.
 Pulls every judged run (``judge_pending=false`` or any run with the
 composite metric set) from the experiment, groups by (model, prompt)
 cell, and prints a leaderboard sorted by mean composite score.
 Also reports the deterministic-only metrics (latency, format_ok) so
 cells with great prose but broken JSON are visible.
 """
 from __future__ import annotations
 import argparse
 import os
 import statistics
 import sys
 from collections import defaultdict
 from pathlib import Path
 _BENCH = Path(__file__).resolve().parent
 sys.path.insert(0, str(_BENCH))
 from mlflow_client import MLflowClient  # type: ignore
 def _params(run: dict) -> dict[str, str]:
    return {p["key"]: p["value"] for p in run["data"].get("params", [])}
 def _metrics(run: dict) -> dict[str, float]:
    return {m["key"]: m["value"] for m in run["data"].get("metrics", [])}
 def _tags(run: dict) -> dict[str, str]:
    return {t["key"]: t["value"] for t in run["data"].get("tags", [])}
 def main() -> int:
    parser = argparse.ArgumentParser(description="oO bench — Phase C (leaderboard)")
    parser.add_argument("--experiment", required=True)
    parser.add_argument("--mlflow-url", default=os.environ.get("MLFLOW_TRACKING_URI", "http://localhost:5000"))
    parser.add_argument("--include-pending", action="store_true",
                        help="Also include rows with no quality scores (latency/format only).")
    args = parser.parse_args()
    client = MLflowClient(
        tracking_uri=args.mlflow_url,
        username=os.environ.get("MLFLOW_TRACKING_USERNAME") or "admin",
        password=os.environ.get("MLFLOW_TRACKING_PASSWORD") or "password",
    )
    exp_id = client.get_or_create_experiment(args.experiment)
    runs = client.search_runs(exp_id, max_results=2000)
    # Group key = (model, prompt_version)
    cells: dict[tuple[str, str], list[dict]] = defaultdict(list)
    for r in runs:
        params = _params(r)
        metrics = _metrics(r)
        tags = _tags(r)
        if r["info"].get("status") != "FINISHED":
            continue
        if not args.include_pending and "composite" not in metrics:
            continue
        cells[(params.get("model", "?"), params.get("prompt_version", "?"))].append({
            "metrics": metrics,
            "scenario": params.get("scenario_id", "?"),
            "judged": tags.get("judge_pending") == "false",
        })
    if not cells:
        print("No judged runs found. Did you run judge_cli.py --apply?")
        return 1
    rows = []
    for (model, prompt), records in cells.items():
        n = len(records)
        comp = [r["metrics"]["composite"] for r in records if "composite" in r["metrics"]]
        rel  = [r["metrics"]["relevance"] for r in records if "relevance" in r["metrics"]]
        act  = [r["metrics"]["actionability"] for r in records if "actionability" in r["metrics"]]
        tone = [r["metrics"]["tone"] for r in records if "tone" in r["metrics"]]
        lat  = [r["metrics"]["latency_ms"] for r in records if "latency_ms" in r["metrics"]]
        fmt  = [r["metrics"]["format_ok"] for r in records if "format_ok" in r["metrics"]]
        div  = [r["metrics"]["mean_diversity"] for r in records if "mean_diversity" in r["metrics"]]
        rows.append({
            "model": model,
            "prompt": prompt,
            "n": n,
            "composite": statistics.mean(comp) if comp else None,
            "relevance": statistics.mean(rel) if rel else None,
            "actionability": statistics.mean(act) if act else None,
            "tone": statistics.mean(tone) if tone else None,
            "format_ok": statistics.mean(fmt) if fmt else None,
            "latency_p50": statistics.median(lat) if lat else None,
            "latency_p95": _p95(lat) if lat else None,
            "diversity": statistics.mean(div) if div else None,
        })
    rows.sort(key=lambda r: r["composite"] if r["composite"] is not None else -1, reverse=True)
    # Width-fitted printer — keeps output legible in a 100-col terminal.
    print()
    print(f"Experiment: {args.experiment}  (id={exp_id})")
    print(f"Cells     : {len(rows)}")
    print()
    header = (
        f"{'#':>2}  {'model':18s} {'prompt':12s} {'n':>3s}  "
        f"{'comp':>5s} {'rel':>4s} {'act':>4s} {'tone':>4s} "
        f"{'fmt':>4s} {'p50':>6s} {'p95':>6s} {'div':>5s}"
    )
    print(header)
    print("─" * len(header))
    for i, r in enumerate(rows, 1):
        comp = f"{r['composite']:.2f}" if r["composite"] is not None else "  -- "
        rel  = f"{r['relevance']:.1f}" if r["relevance"] is not None else " -- "
        act  = f"{r['actionability']:.1f}" if r["actionability"] is not None else " -- "
        tone = f"{r['tone']:.1f}" if r["tone"] is not None else " -- "
        fmt  = f"{r['format_ok']:.2f}" if r["format_ok"] is not None else " -- "
        p50  = f"{r['latency_p50']:.0f}" if r["latency_p50"] is not None else "  --  "
        p95  = f"{r['latency_p95']:.0f}" if r["latency_p95"] is not None else "  --  "
        div  = f"{r['diversity']:.2f}" if r["diversity"] is not None else " -- "
        print(
            f"{i:>2}  {r['model']:18s} {r['prompt']:12s} {r['n']:>3d}  "
            f"{comp:>5s} {rel:>4s} {act:>4s} {tone:>4s} "
            f"{fmt:>4s} {p50:>6s} {p95:>6s} {div:>5s}"
        )
    if rows[0]["composite"] is not None:
        winner = rows[0]
        print()
        print(f"Winner: {winner['model']} × {winner['prompt']}  "
              f"(composite={winner['composite']:.2f}, n={winner['n']})")
    return 0
 def _p95(xs: list[float]) -> float:
    if not xs:
        return 0.0
    s = sorted(xs)
    idx = max(0, int(round(0.95 * (len(s) - 1))))
    return s[idx]
 if __name__ == "__main__":
    sys.exit(main())
--- a/ml/experiments/bench/judge_cli.py
+++ b/ml/experiments/bench/judge_cli.py
@@ -0,0 +1,191 @@
 """Phase B — Claude Code as the lazy MLflow judge.
 Two sub-commands, both keyed to MLflow tags so the same run cycles
 through ``judge_pending=true`` → judged → ``judge_pending=false`` exactly
 once.
  --export PATH
      Pull every run with ``judge_pending=true`` and ``judge_kind=claude-code``
      from the experiment, bundle the prompt + parsed candidates + the
      rubric into a single JSON file the Claude Code session can read.
  --apply PATH
      Read the responses (same shape as the request, with ``scores`` filled in)
      and log ``relevance``, ``actionability``, ``tone``, ``overlong`` as
      MLflow metrics on the corresponding runs. Sets ``judge_pending=false``
      and stamps ``judged_at`` / ``judged_by`` so the run won't be picked up
      twice.
 The request file is intentionally one big JSON document, so the human
 judge sees the full set in one place and can score consistently.
 """
 from __future__ import annotations
 import argparse
 import json
 import os
 import sys
 import time
 from pathlib import Path
 _BENCH = Path(__file__).resolve().parent
 sys.path.insert(0, str(_BENCH))
 from mlflow_client import MLflowClient  # type: ignore
 _DIMENSIONS = ("relevance", "actionability", "tone")
 _BIN_FLAGS = ("overlong",)
 def _tags_dict(run: dict) -> dict[str, str]:
    return {t["key"]: t["value"] for t in run.get("data", {}).get("tags", [])}
 def _params_dict(run: dict) -> dict[str, str]:
    return {p["key"]: p["value"] for p in run.get("data", {}).get("params", [])}
 def export(client: MLflowClient, experiment: str, out_path: str) -> int:
    exp_id = client.get_or_create_experiment(experiment)
    runs = client.search_runs(
        exp_id,
        filter_string="tags.judge_pending = 'true' and tags.judge_kind = 'claude-code'",
    )
    if not runs:
        print("No pending runs.")
        Path(out_path).write_text(json.dumps({
            "experiment": experiment,
            "exported_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
            "rubric": "tip-v1",
            "items": [],
        }, indent=2))
        return 0
    rubric_text = (_BENCH / "rubric.md").read_text(encoding="utf-8")
    items: list[dict] = []
    for run in runs:
        run_id = run["info"]["run_id"]
        tags = _tags_dict(run)
        params = _params_dict(run)
        candidates_json = client.get_artifact_text(run_id, "candidates.json")
        prompt_text = client.get_artifact_text(run_id, "prompt.txt")
        try:
            candidates = json.loads(candidates_json) if candidates_json else []
        except json.JSONDecodeError:
            candidates = []
        items.append({
            "run_id": run_id,
            "model": params.get("model") or tags.get("model"),
            "prompt_version": params.get("prompt_version") or tags.get("prompt_version"),
            "scenario_id": params.get("scenario_id") or tags.get("scenario_id"),
            "persona": params.get("persona") or tags.get("persona"),
            "hour_of_day": int(params.get("hour_of_day", "12")),
            "day_of_week": int(params.get("day_of_week", "0")),
            "prompt": prompt_text,
            "candidates": candidates,
            # Per-run scoring slot — judge fills these in.
            "scores": {
                "relevance": None,        # 1–5, integer
                "actionability": None,    # 1–5, integer
                "tone": None,             # 1–5, integer
                "overlong": None,         # 0/1
                "notes": "",              # short comment, optional
            },
        })
    out = {
        "experiment": experiment,
        "exported_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        "rubric": "tip-v1",
        "rubric_md": rubric_text,
        "items": items,
    }
    Path(out_path).write_text(json.dumps(out, indent=2, ensure_ascii=False))
    print(f"Exported {len(items)} pending runs → {out_path}")
    return 0
 def apply(client: MLflowClient, experiment: str, in_path: str) -> int:
    exp_id = client.get_or_create_experiment(experiment)
    payload = json.loads(Path(in_path).read_text(encoding="utf-8"))
    items = payload.get("items", [])
    if not items:
        print("No items in response file.")
        return 0
    judged_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    n_applied, n_skipped = 0, 0
    for item in items:
        run_id = item["run_id"]
        scores = item.get("scores") or {}
        missing = [d for d in _DIMENSIONS if scores.get(d) in (None, "")]
        if missing:
            print(f"  [skip] {run_id}: missing {missing}")
            n_skipped += 1
            continue
        metrics = {d: float(scores[d]) for d in _DIMENSIONS}
        for f in _BIN_FLAGS:
            v = scores.get(f)
            if v not in (None, ""):
                metrics[f] = float(int(bool(int(v))))
        # Composite mirrors rubric.md: relevance + actionability + tone
        # + 2 * format_ok - overlong.  format_ok is already a metric on
        # the run from collect.py; re-fetching is cheap and keeps this
        # script idempotent if format compliance was retroactively fixed.
        run = client._get("/runs/get", {"run_id": run_id})["run"]
        existing_metrics = {m["key"]: m["value"] for m in run["data"].get("metrics", [])}
        format_ok = float(existing_metrics.get("format_ok", 0.0))
        overlong = metrics.get("overlong", 0.0)
        composite = (
            metrics["relevance"] + metrics["actionability"] + metrics["tone"]
            + 2 * format_ok - overlong
        )
        metrics["composite"] = composite
        client.log_metrics(run_id, metrics)
        client.set_tags(run_id, {
            "judge_pending": "false",
            "judged_at": judged_at,
            "judged_by": "claude-code-session",
        })
        if scores.get("notes"):
            client.set_tag(run_id, "judge_notes", str(scores["notes"])[:1000])
        n_applied += 1
        print(f"  [ok]   {run_id}: rel={metrics['relevance']:.1f} "
              f"act={metrics['actionability']:.1f} tone={metrics['tone']:.1f} "
              f"comp={composite:.2f}")
    print(f"Applied {n_applied}, skipped {n_skipped}.")
    return 0
 def main() -> int:
    parser = argparse.ArgumentParser(description="oO bench — Phase B (Claude Code judge)")
    parser.add_argument("--experiment", required=True)
    parser.add_argument("--mlflow-url", default=os.environ.get("MLFLOW_TRACKING_URI", "http://localhost:5000"))
    grp = parser.add_mutually_exclusive_group(required=True)
    grp.add_argument("--export", metavar="PATH",
                     help="Write pending runs as a judgment-request JSON file.")
    grp.add_argument("--apply", metavar="PATH",
                     help="Read filled-in responses and write metrics back to MLflow.")
    args = parser.parse_args()
    client = MLflowClient(
        tracking_uri=args.mlflow_url,
        username=os.environ.get("MLFLOW_TRACKING_USERNAME") or "admin",
        password=os.environ.get("MLFLOW_TRACKING_PASSWORD") or "password",
    )
    if args.export:
        return export(client, args.experiment, args.export)
    return apply(client, args.experiment, args.apply)
 if __name__ == "__main__":
    sys.exit(main())
--- a/ml/experiments/bench/mlflow_client.py
+++ b/ml/experiments/bench/mlflow_client.py
@@ -0,0 +1,201 @@
 """Thin MLflow REST wrapper.
 Why not the official ``mlflow`` SDK? Two reasons specific to the oO setup:
 1. The MLflow server (3.11) ships with ``--allowed-hosts localhost`` but
   curl / requests / urllib3 send ``Host: localhost:5000`` — the port
   suffix fails the DNS-rebinding check. We override the Host header per
   request, which the SDK doesn't expose.
 2. The collect/judge phases only need ~6 endpoints (create/search/log).
   Pulling a 200MB SDK transitively for that is excess weight.
 All calls are synchronous httpx with explicit ``Host`` so the script can
 run from the host shell or from inside docker without further config.
 """
 from __future__ import annotations
 import os
 import time
 from dataclasses import dataclass
 from typing import Any
 import httpx
 def _strip_path(uri: str) -> tuple[str, str]:
    """Return (origin, path_prefix) — handles both /mlflow and / roots.
    ``http://mlflow:5000/mlflow``  → ("http://mlflow:5000", "/mlflow")
    ``http://localhost:5000``      → ("http://localhost:5000", "")
    """
    uri = uri.rstrip("/")
    if "/" not in uri.split("://", 1)[1]:
        return uri, ""
    scheme_host, _, rest = uri.partition("://")
    host, _, path = rest.partition("/")
    return f"{scheme_host}://{host}", "/" + path if path else ""
@dataclass
 class MLflowClient:
    tracking_uri: str
    username: str | None = None
    password: str | None = None
    host_header: str | None = None  # override for DNS-rebinding sidestep
    timeout: float = 30.0
    def __post_init__(self) -> None:
        self._origin, self._ui_prefix = _strip_path(self.tracking_uri)
        # MLflow 3.x exposes the REST API at the root, *not* under the
        # ``/mlflow`` UI prefix. Empirically verified against the running
        # ghcr.io/mlflow/mlflow:v3.11.1 container.
        self._api = f"{self._origin}/api/2.0/mlflow"
        self._auth = (self.username, self.password) if self.username else None
        # If user did not pass a host header, derive from origin. Strip
        # the port if present — the server's allowed-hosts check rejects
        # ``localhost:5000`` even when ``localhost`` is allowed.
        if self.host_header is None:
            host = self._origin.split("://", 1)[1]
            self.host_header = host.split(":", 1)[0]
    @classmethod
    def from_env(cls) -> "MLflowClient":
        return cls(
            tracking_uri=os.environ.get("MLFLOW_TRACKING_URI", "http://localhost:5000"),
            username=os.environ.get("MLFLOW_TRACKING_USERNAME") or "admin",
            password=os.environ.get("MLFLOW_TRACKING_PASSWORD") or "password",
            host_header=os.environ.get("MLFLOW_HOST_HEADER"),
        )
    def _headers(self) -> dict[str, str]:
        return {"Host": self.host_header or "localhost"}
    def _post(self, path: str, body: dict) -> dict:
        with httpx.Client(trust_env=False, timeout=self.timeout) as c:
            r = c.post(f"{self._api}{path}", json=body, headers=self._headers(), auth=self._auth)
            r.raise_for_status()
            return r.json()
    def _get(self, path: str, params: dict | None = None) -> dict:
        with httpx.Client(trust_env=False, timeout=self.timeout) as c:
            r = c.get(f"{self._api}{path}", params=params or {}, headers=self._headers(), auth=self._auth)
            r.raise_for_status()
            return r.json()
    # ── Experiments ────────────────────────────────────────────────────
    def get_or_create_experiment(self, name: str) -> str:
        try:
            r = self._get("/experiments/get-by-name", {"experiment_name": name})
            return r["experiment"]["experiment_id"]
        except httpx.HTTPStatusError as e:
            if e.response.status_code not in (404, 400):
                raise
        r = self._post("/experiments/create", {"name": name})
        return r["experiment_id"]
    # ── Runs ───────────────────────────────────────────────────────────
    def create_run(
        self,
        experiment_id: str,
        run_name: str,
        tags: dict[str, str] | None = None,
    ) -> str:
        body: dict[str, Any] = {
            "experiment_id": experiment_id,
            "start_time": int(time.time() * 1000),
            "run_name": run_name,
            "tags": [
                {"key": k, "value": str(v)}
                for k, v in (tags or {}).items()
            ],
        }
        r = self._post("/runs/create", body)
        return r["run"]["info"]["run_id"]
    def log_param(self, run_id: str, key: str, value: Any) -> None:
        self._post("/runs/log-parameter", {"run_id": run_id, "key": key, "value": str(value)})
    def log_params(self, run_id: str, params: dict[str, Any]) -> None:
        for k, v in params.items():
            self.log_param(run_id, k, v)
    def log_metric(self, run_id: str, key: str, value: float, step: int = 0) -> None:
        self._post("/runs/log-metric", {
            "run_id": run_id,
            "key": key,
            "value": float(value),
            "timestamp": int(time.time() * 1000),
            "step": step,
        })
    def log_metrics(self, run_id: str, metrics: dict[str, float]) -> None:
        for k, v in metrics.items():
            self.log_metric(run_id, k, v)
    def set_tag(self, run_id: str, key: str, value: str) -> None:
        self._post("/runs/set-tag", {"run_id": run_id, "key": key, "value": str(value)})
    def set_tags(self, run_id: str, tags: dict[str, str]) -> None:
        for k, v in tags.items():
            self.set_tag(run_id, k, v)
    # MLflow tag values are capped at 5000 chars by the server (RESOURCE_DOES_NOT_EXIST
    # below that, INVALID_PARAMETER_VALUE above). 4500 leaves headroom for
    # internal metadata MLflow may append on its own.
    _TAG_VALUE_LIMIT = 4500
    def log_text(self, run_id: str, text: str, artifact_path: str) -> None:
        """Persist short text alongside the run.
        The MLflow server in this deployment uses a ``file://`` artifact
        backend, which is only reachable from inside the container — not
        via the REST proxy. We instead stash short payloads as tags
        keyed ``artifact:<path>``. Anything longer than 4500 chars is
        chunked into ``artifact:<path>:0``, ``:1`` …; ``get_artifact_text``
        re-stitches them in order.
        """
        key_base = f"artifact:{artifact_path}"
        if len(text) <= self._TAG_VALUE_LIMIT:
            self.set_tag(run_id, key_base, text)
            return
        # chunk
        for i in range(0, len(text), self._TAG_VALUE_LIMIT):
            self.set_tag(run_id, f"{key_base}:{i // self._TAG_VALUE_LIMIT}",
                          text[i:i + self._TAG_VALUE_LIMIT])
    def get_artifact_text(self, run_id: str, artifact_path: str) -> str:
        run = self._get("/runs/get", {"run_id": run_id})["run"]
        tags = {t["key"]: t["value"] for t in run["data"].get("tags", [])}
        key_base = f"artifact:{artifact_path}"
        if key_base in tags:
            return tags[key_base]
        # chunked form
        chunks = sorted(
            (k for k in tags if k.startswith(f"{key_base}:")),
            key=lambda k: int(k.rsplit(":", 1)[1]),
        )
        return "".join(tags[k] for k in chunks)
    def end_run(self, run_id: str, status: str = "FINISHED") -> None:
        self._post("/runs/update", {
            "run_id": run_id,
            "status": status,
            "end_time": int(time.time() * 1000),
        })
    def search_runs(
        self,
        experiment_id: str,
        filter_string: str = "",
        max_results: int = 1000,
    ) -> list[dict]:
        body = {
            "experiment_ids": [experiment_id],
            "filter": filter_string,
            "max_results": max_results,
        }
        r = self._post("/runs/search", body)
        return r.get("runs", [])
--- a/ml/experiments/bench/rubric.md
+++ b/ml/experiments/bench/rubric.md
@@ -0,0 +1,85 @@
 # Tip-quality rubric — `tip-v1`
 This file is the consistency anchor for the Claude Code judge. The same
 rubric is used across every judging session so verdicts are comparable
 across runs (per the lazy-judge pattern in #95).
 Each candidate tip is scored on three independent 1–5 dimensions, plus
 two binary flags. Score the **content of the tip itself** for the given
 persona/context — do not score the rationale.
 ## Dimensions
 ### relevance — 1 to 5
 How well does the tip respond to *this specific persona at this specific
 time*? A generic productivity platitude is 1; a tip that hooks into the
 persona's stated preferences and the actual hour-of-day is 5.
 | score | description |
 |-------|-------------|
 | 1 | Boilerplate. Could apply to any user, any time. |
 | 2 | Vaguely fits the persona but ignores context. |
 | 3 | Fits the persona OR the time, not both. |
 | 4 | Fits both persona and time, with one specific anchor (a task, an hour, a habit). |
 | 5 | Specific to the persona's preferences AND respects the hour, with a clear hook into a candidate task or routine. |
 ### actionability — 1 to 5
 Could the user *do this in the next 10 minutes* without further planning?
 "Try to focus more" is 1; "Spend 12 minutes on the Call dentist task and
 stop when the timer ends" is 5.
 | score | description |
 |-------|-------------|
 | 1 | Pure encouragement, no action. |
 | 2 | Action exists but vague ("review your tasks"). |
 | 3 | Concrete verb + object, but missing the time/duration handle. |
 | 4 | Concrete action with a duration or trigger ("for 10 minutes", "before lunch"). |
 | 5 | Micro-action with explicit start, duration, and a stop condition. |
 ### tone — 1 to 5
 Does the tip sound like a calm, specific mentor (the product voice) or
 like a generic chatbot/coach? Penalize emoji-spam, exclamation marks,
 hype words ("amazing!", "let's crush it!"), and corporate jargon.
 | score | description |
 |-------|-------------|
 | 1 | Hype, jargon, or motivational-poster tone. |
 | 2 | Polite chatbot tone, no warmth. |
 | 3 | Neutral, businesslike. |
 | 4 | Quiet and specific, like a coach who knows you. |
 | 5 | Earned. Reads like a mentor who has seen this exact stuck-pattern before. |
 ## Binary flags
 ### format_ok — 0 or 1
 1 if the *whole response* parsed as a JSON array of objects with the
 required keys (`id`, `content`, `rationale`). 0 otherwise. **This is
 computed automatically by `collect.py`** — judges should not override it.
 ### overlong — 0 or 1
 1 if `content` exceeds the documented 2-sentence cap (count sentence-
 ending punctuation `. ! ?`). Judges may flag this as a tiebreaker.
 ## Composite score
 `compare.py` ranks cells by:
 ```
 composite = relevance + actionability + tone + 2*format_ok - overlong
 ```
 i.e. format compliance is a doubled weight (a malformed JSON is a hard
 production failure regardless of how good the prose is).
 ## Calibration examples
 (Shared with judges so a 4 means the same thing across sessions.)
 **Persona**: deadline-driven (responds to overdue/high-priority,
 morning-active). **Hour**: 09:00. **Tasks include**: an overdue
 "Call dentist", priority 4.
 - "Stay focused and make today count!" — relevance 1, actionability 1, tone 1.
 - "Review your tasks and pick one that matters." — relevance 2, actionability 2, tone 3.
 - "Spend the next 12 minutes on Call dentist — set a timer and stop when it rings." — relevance 5, actionability 5, tone 4.
 - "It's 09:00 — you respond to overdue items best now. Block 12 minutes for Call dentist before your first meeting." — relevance 5, actionability 5, tone 5.
--- a/ml/experiments/bench/scenarios.py
+++ b/ml/experiments/bench/scenarios.py
@@ -0,0 +1,80 @@
 """Fixed contexts for the tip-generation benchmark.
 Every cell of the (model × prompt) grid is evaluated on the *same* set of
 scenarios so quality differences are attributable to the model/prompt,
 not to context variance.
 A scenario is one (persona, hour-of-day, candidate-task-pool) tuple. The
 hour and the task pool are seeded deterministically from the persona's
 name so the bench is reproducible across machines.
 """
 from __future__ import annotations
 import sys
 from dataclasses import dataclass
 from pathlib import Path
 # Reuse personas from sim — same source of truth for user archetypes.
 sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "sim"))
 from personas import PERSONAS, Persona  # type: ignore
 from task_generator import generate_task_pool  # type: ignore
@dataclass(frozen=True)
 class Scenario:
    id: str           # stable id used as MLflow tag — keep ASCII safe
    persona: Persona
    hour_of_day: int  # 0–23
    day_of_week: int  # 0=Mon
    tasks: list[dict]
    def to_prompt_context(self) -> dict:
        """Shape expected by ml/serving/prompts.PromptContext."""
        return {
            "tasks": [
                {
                    "content": t["content"],
                    "priority": t["features"]["priority"],
                    "is_overdue": t["features"]["is_overdue"],
                    "due_date": t.get("due_date", "no due date"),
                }
                for t in self.tasks
            ],
            "hour_of_day": self.hour_of_day,
            "day_of_week": self.day_of_week,
            "extra": {
                "persona": self.persona.name,
                "persona_hint": self.persona.description,
            },
        }
 # Two time-slots probe whether the model adapts its tone to the hour.
 # Morning (09) and evening (21) are picked because most personas have
 # strong directional preferences there.
 _TIME_SLOTS = [(9, 1), (21, 3)]   # (hour_of_day, day_of_week)
 def build_scenarios(tasks_per_scenario: int = 6) -> list[Scenario]:
    """Return a deterministic list of scenarios.
    With 4 personas × 2 time-slots = 8 scenarios. Task pools are seeded
    by ``hash(persona.name) + hour`` so runs are reproducible and each
    persona sees the same tasks at the same hour across cells.
    """
    out: list[Scenario] = []
    for persona in PERSONAS[:4]:
        for hour, dow in _TIME_SLOTS:
            seed = (abs(hash(persona.name)) % 9973) + hour
            tasks = generate_task_pool(n=tasks_per_scenario, seed=seed)
            out.append(
                Scenario(
                    id=f"{persona.name}-h{hour:02d}",
                    persona=persona,
                    hour_of_day=hour,
                    day_of_week=dow,
                    tasks=tasks,
                )
            )
    return out
--- a/ml/experiments/sim/runner.py
+++ b/ml/experiments/sim/runner.py
@@ -26,6 +26,7 @@ from __future__ import annotations
 import argparse
 import json
 import os
 import random
 import sys
 import time
@@ -40,6 +41,12 @@ from llm_judge import ACTIONS, infer_reward, judge
 from personas import PERSONAS, Persona
 from task_generator import generate_task_pool
 try:
    import mlflow
    _MLFLOW_AVAILABLE = True
 except ImportError:
    _MLFLOW_AVAILABLE = False
 POLICY_SCORE_ENDPOINTS: dict[str, str] = {
    "linucb-v1": "/score",
    "egreedy-v1": "/score/egreedy",
@@ -107,14 +114,30 @@ def _call_reward(
 # ── Standard single-pass runner (rule / llm modes) ─────────────────────────
 def _init_mlflow(mlflow_url: str | None, experiment: str) -> str | None:
    """Set up MLflow tracking and return the active run_id, or None if unavailable."""
    if not _MLFLOW_AVAILABLE or not mlflow_url:
        return None
    try:
        mlflow.set_tracking_uri(mlflow_url)
        mlflow.set_experiment(experiment)
        return "ready"
    except Exception as e:
        print(f"  [warn] MLflow init failed: {e}", file=sys.stderr)
        return None
 def run_simulation(
    n_users: int, n_rounds: int, tasks_per_round: int,
    ml_url: str, policies: list[str], use_llm: bool, seed: int,
    mlflow_url: str | None = None, mlflow_experiment: str = "bandit_simulation",
 ) -> dict:
    rng = random.Random(seed)
    run_id = str(uuid.uuid4())[:8]
    started_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    _init_mlflow(mlflow_url, mlflow_experiment)
    user_personas = [
        (f"sim-{run_id}-u{i}", PERSONAS[i % len(PERSONAS)])
        for i in range(n_users)
@@ -130,62 +153,101 @@ def run_simulation(
    }
    events: list[dict] = []
-    with httpx.Client(trust_env=False) as client:
+    mlflow_run_id: str | None = None
-        for rnd in range(n_rounds):
+    mlflow_ctx = (
-            hour = rng.randint(6, 22)
+        mlflow.start_run(run_name=run_id)
-            dow = rng.randint(0, 6)
+        if (_MLFLOW_AVAILABLE and mlflow_url)
-            round_rewards = {p: 0.0 for p in policies}
+        else None
    )
-            for user_id, persona in user_personas:
+    try:
-                seed_tasks = rnd * 997 + abs(hash(user_id)) % 997
+        if mlflow_ctx:
-                tasks = generate_task_pool(n=tasks_per_round, seed=seed_tasks)
+            active = mlflow_ctx.__enter__()
            mlflow_run_id = active.info.run_id
            mlflow.log_params({
                "n_users": n_users,
                "n_rounds": n_rounds,
                "tasks_per_round": tasks_per_round,
                "policies": ",".join(policies),
                "judge": "llm" if use_llm else "rule",
                "seed": seed,
            })
-                # Per-persona profile features for v2 (synthetic for sim — see ADR-0012)
+        with httpx.Client(trust_env=False) as client:
-                profile = persona.profile_features(hour) if hasattr(persona, "profile_features") else None
+            for rnd in range(n_rounds):
                hour = rng.randint(6, 22)
                dow = rng.randint(0, 6)
                round_rewards = {p: 0.0 for p in policies}
-                for policy in policies:
+                for user_id, persona in user_personas:
-                    p_user = f"{user_id}-{policy}"
+                    seed_tasks = rnd * 997 + abs(hash(user_id)) % 997
-                    scored = _call_score(client, ml_url, policy, p_user, tasks, hour, dow,
+                    tasks = generate_task_pool(n=tasks_per_round, seed=seed_tasks)
-                                         profile_features=profile)
+                    profile = persona.profile_features(hour) if hasattr(persona, "profile_features") else None
                    if not scored:
                        continue
                    tip_id = scored.get("tip_id")
                    tip = next((t for t in tasks if t["id"] == tip_id), None)
                    if not tip:
                        continue
-                    action, dwell_ms, reward = judge(persona, tip, hour, dow, rng, use_llm=use_llm)
+                    for policy in policies:
-                    _call_reward(client, ml_url, policy, p_user, tip_id, reward, {
+                        p_user = f"{user_id}-{policy}"
-                        "hour_of_day": hour,
+                        scored = _call_score(client, ml_url, policy, p_user, tasks, hour, dow,
-                        "is_overdue": tip["features"]["is_overdue"],
+                                             profile_features=profile)
-                        "task_age_days": tip["features"]["task_age_days"],
+                        if not scored:
-                        "priority": tip["features"]["priority"],
+                            continue
-                    }, day_of_week=dow, profile_features=profile)
+                        tip_id = scored.get("tip_id")
                        tip = next((t for t in tasks if t["id"] == tip_id), None)
                        if not tip:
                            continue
-                    acc[policy]["total_reward"] += reward
+                        action, dwell_ms, reward = judge(persona, tip, hour, dow, rng, use_llm=use_llm)
-                    acc[policy]["n_pulls"] += 1
+                        _call_reward(client, ml_url, policy, p_user, tip_id, reward, {
-                    acc[policy]["action_counts"][action] += 1
+                            "hour_of_day": hour,
-                    round_rewards[policy] += reward
+                            "is_overdue": tip["features"]["is_overdue"],
-                    events.append({
+                            "task_age_days": tip["features"]["task_age_days"],
-                        "round": rnd, "user_id": user_id, "persona": persona.name,
+                            "priority": tip["features"]["priority"],
-                        "policy": policy, "tip_content": tip["content"],
+                        }, day_of_week=dow, profile_features=profile)
                        "priority": tip["features"]["priority"],
                        "is_overdue": tip["features"]["is_overdue"],
                        "action": action, "dwell_ms": dwell_ms, "reward": reward,
                        "hour": hour, "day_of_week": dow,
                    })
-            for p in policies:
+                        acc[policy]["total_reward"] += reward
-                prev = acc[p]["cumulative_rewards"][-1] if acc[p]["cumulative_rewards"] else 0.0
+                        acc[policy]["n_pulls"] += 1
-                acc[p]["cumulative_rewards"].append(prev + round_rewards[p])
+                        acc[policy]["action_counts"][action] += 1
                        round_rewards[policy] += reward
                        events.append({
                            "round": rnd, "user_id": user_id, "persona": persona.name,
                            "policy": policy, "tip_content": tip["content"],
                            "priority": tip["features"]["priority"],
                            "is_overdue": tip["features"]["is_overdue"],
                            "action": action, "dwell_ms": dwell_ms, "reward": reward,
                            "hour": hour, "day_of_week": dow,
                        })
-            mode = "llm" if use_llm else "rule"
+                for p in policies:
-            print(f"  Round {rnd+1:>3}/{n_rounds} [{mode}]  " + "  ".join(
+                    prev = acc[p]["cumulative_rewards"][-1] if acc[p]["cumulative_rewards"] else 0.0
-                f"{p}={acc[p]['cumulative_rewards'][-1]:+.2f}" for p in policies
+                    acc[p]["cumulative_rewards"].append(prev + round_rewards[p])
            ))
-    return _build_result(run_id, started_at, policies, acc, events,
+                if mlflow_ctx:
-                         n_users, n_rounds, tasks_per_round, use_llm, seed)
+                    for p in policies:
                        mlflow.log_metric(f"{p}_cumulative_reward",
                                          acc[p]["cumulative_rewards"][-1], step=rnd)
                mode = "llm" if use_llm else "rule"
                print(f"  Round {rnd+1:>3}/{n_rounds} [{mode}]  " + "  ".join(
                    f"{p}={acc[p]['cumulative_rewards'][-1]:+.2f}" for p in policies
                ))
        result = _build_result(run_id, started_at, policies, acc, events,
                               n_users, n_rounds, tasks_per_round, use_llm, seed)
        result["mlflow_run_id"] = mlflow_run_id
        if mlflow_ctx:
            for p, s in result["summary"].items():
                mlflow.log_metrics({
                    f"{p}_total_reward": s["total_reward"],
                    f"{p}_mean_reward": s["mean_reward"],
                    f"{p}_n_pulls": s["n_pulls"],
                })
            mlflow.set_tag("winner", result["winner"])
        return result
    finally:
        if mlflow_ctx:
            mlflow_ctx.__exit__(None, None, None)
 # ── Claude Code judge — phase 1: score ─────────────────────────────────────
@@ -494,6 +556,9 @@ if __name__ == "__main__":
                        help="Alias for --judge rule (backwards compat)")
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--out", default=None)
    parser.add_argument("--mlflow-url", default=os.environ.get("MLFLOW_TRACKING_URI"),
                        help="MLflow tracking URI (e.g. http://mlflow:5000/mlflow)")
    parser.add_argument("--mlflow-experiment", default="bandit_simulation")
    args = parser.parse_args()
    if args.no_llm:
@@ -534,6 +599,7 @@ if __name__ == "__main__":
            n_users=args.n_users, n_rounds=args.n_rounds,
            tasks_per_round=args.tasks_per_round, ml_url=args.ml_url,
            policies=args.policies, use_llm=use_llm, seed=args.seed,
            mlflow_url=args.mlflow_url, mlflow_experiment=args.mlflow_experiment,
        )
        Path(out_path).write_text(json.dumps(result, indent=2))
        print()
--- a/ml/features/init.py
+++ b/ml/features/init.py
@@ -1,3 +1,8 @@
-from .context import build_context, PromptContext, TaskSignal
+from .context import build_context, PromptContext, TaskSignal, ContextFeatureSpec, CONTEXT_FEATURES
 from .profile_schema import ProfileFeature, PROFILE_FEATURES, feature_names
-__all__ = ["build_context", "PromptContext", "TaskSignal"]
+__all__ = [
    "build_context", "PromptContext", "TaskSignal",
    "ContextFeatureSpec", "CONTEXT_FEATURES",
    "ProfileFeature", "PROFILE_FEATURES", "feature_names",
 ]
--- a/ml/features/context.py
+++ b/ml/features/context.py
@@ -2,12 +2,56 @@
 Context assembler — converts raw user signals into a PromptContext for LLM tip generation.
 Usage:
-    from ml.features.context import build_context
+    from ml.features.context import build_context, CONTEXT_FEATURES
    ctx = build_context(tasks, hour_of_day=9, day_of_week=2)
 Feature-spec (issue #61):
  All context features are JIT — they are assembled at request time from live
  sources (system clock, caller-supplied task list) rather than read from a
  cached profile store. They carry no TTL because they are never persisted.
 """
 from __future__ import annotations
 from dataclasses import dataclass, field
 from typing import Literal
@dataclass(frozen=True)
 class ContextFeatureSpec:
    name: str
    dtype: Literal["numeric", "categorical", "list"]
    freshness: Literal["jit", "batched"]
    source: str
    fallback: str
    description: str
 CONTEXT_FEATURES: tuple[ContextFeatureSpec, ...] = (
    ContextFeatureSpec(
        name="hour_of_day",
        dtype="numeric",
        freshness="jit",
        source="request",
        fallback="12",
        description="Current hour (0–23), supplied by the caller at score time.",
    ),
    ContextFeatureSpec(
        name="day_of_week",
        dtype="numeric",
        freshness="jit",
        source="request",
        fallback="0",
        description="ISO weekday (0=Monday … 6=Sunday), supplied by the caller at score time.",
    ),
    ContextFeatureSpec(
        name="tasks",
        dtype="list",
        freshness="jit",
        source="todoist-integration",
        fallback="[]",
        description="User's open tasks fetched live from the Todoist integration at request time.",
    ),
 )
@dataclass
--- a/ml/features/profile_schema.py
+++ b/ml/features/profile_schema.py
@@ -8,6 +8,14 @@ code (ml/serving, eval harnesses, notebooks) knows what fields to expect on
 Update this file whenever you add or rename a feature in the TS registry.
 The accompanying test asserts the two stay in sync at the name level.
 Feature-spec fields (issue #61):
  freshness      — "batched": value cached in profile store, recomputed on TTL/event.
  ttl_sec        — cache lifetime in seconds; mirrors ``ttlSec`` in registry.ts.
  source         — where the value originates.
  fallback       — raw value returned when the feature is unavailable (null stored).
  invalidated_by — bus event subjects that trigger recompute for the affected user;
                   mirrors ``invalidatedBy`` in registry.ts. Empty = TTL-only refresh.
 """
 from __future__ import annotations
@@ -16,6 +24,10 @@ from typing import Literal
 Dtype = Literal["numeric", "categorical"]
 Freshness = Literal["jit", "batched"]
 _HOUR = 3600
 _DAY = 86_400
@dataclass(frozen=True)
@@ -23,28 +35,63 @@ class ProfileFeature:
    name: str
    dtype: Dtype
    description: str
    freshness: Freshness
    ttl_sec: int
    source: str
    fallback: str
    invalidated_by: tuple[str, ...] = ()
 PROFILE_FEATURES: tuple[ProfileFeature, ...] = (
    ProfileFeature(
-        "completion_rate_30d", "numeric",
+        name="completion_rate_30d",
-        'Fraction of tips served in the last 30 days that received a "done" reaction.',
+        dtype="numeric",
        description='Fraction of tips served in the last 30 days that received a "done" reaction.',
        freshness="batched",
        ttl_sec=6 * _HOUR,
        source="profile_store",
        fallback="0.0",
        invalidated_by=("signals.tip.feedback",),
    ),
    ProfileFeature(
-        "dismiss_rate_30d", "numeric",
+        name="dismiss_rate_30d",
-        'Fraction of tips served in the last 30 days that received a "dismiss" reaction.',
+        dtype="numeric",
        description='Fraction of tips served in the last 30 days that received a "dismiss" reaction.',
        freshness="batched",
        ttl_sec=6 * _HOUR,
        source="profile_store",
        fallback="0.0",
        invalidated_by=("signals.tip.feedback",),
    ),
    ProfileFeature(
-        "mean_dwell_ms_30d", "numeric",
+        name="mean_dwell_ms_30d",
-        "Average dwell time (ms between served and reacted) over the last 30 days.",
+        dtype="numeric",
        description="Average dwell time (ms between served and reacted) over the last 30 days.",
        freshness="batched",
        ttl_sec=6 * _HOUR,
        source="profile_store",
        fallback="null — serving normalises to 0.0",
        invalidated_by=("signals.tip.feedback",),
    ),
    ProfileFeature(
-        "preferred_hour", "numeric",
+        name="preferred_hour",
-        'Hour-of-day with the most "done" reactions in the last 30 days (0-23).',
+        dtype="numeric",
        description='Hour-of-day with the most "done" reactions in the last 30 days (0–23).',
        freshness="batched",
        ttl_sec=_DAY,
        source="profile_store",
        fallback="null — serving normalises to 0.5 (neutral alignment)",
        invalidated_by=("signals.tip.feedback",),
    ),
    ProfileFeature(
-        "tip_volume_30d", "numeric",
+        name="tip_volume_30d",
-        "Number of tips served to the user in the last 30 days.",
+        dtype="numeric",
        description="Number of tips served to the user in the last 30 days.",
        freshness="batched",
        ttl_sec=_HOUR,
        source="profile_store",
        fallback="0",
        invalidated_by=("signals.tip.served",),
    ),
 )
--- a/ml/features/test_context.py
+++ b/ml/features/test_context.py
@@ -1,7 +1,7 @@
 """Tests for ml/features/context.py"""
 import pytest
 import sys, os; sys.path.insert(0, os.path.dirname(__file__))
-from context import build_context, TaskSignal, PromptContext
+from context import build_context, TaskSignal, PromptContext, CONTEXT_FEATURES
 def test_empty_tasks():
@@ -62,3 +62,30 @@ def test_due_date_none_preserved():
    tasks = [TaskSignal(id="x", content="No due", due_date=None)]
    ctx = build_context(tasks)
    assert ctx.tasks[0]["due_date"] is None
 # ── CONTEXT_FEATURES spec tests (issue #61) ──────────────────────────────────
 def test_context_features_expected_names():
    names = {f.name for f in CONTEXT_FEATURES}
    assert names == {"hour_of_day", "day_of_week", "tasks"}
 def test_context_features_all_jit():
    for f in CONTEXT_FEATURES:
        assert f.freshness == "jit", f"{f.name}: expected freshness='jit', got {f.freshness!r}"
 def test_context_features_source_set():
    for f in CONTEXT_FEATURES:
        assert f.source, f"{f.name}: source must not be empty"
 def test_context_features_fallback_set():
    for f in CONTEXT_FEATURES:
        assert f.fallback, f"{f.name}: fallback must not be empty"
 def test_context_features_no_duplicates():
    names = [f.name for f in CONTEXT_FEATURES]
    assert len(names) == len(set(names)), f"duplicate names: {names}"
--- a/ml/features/test_profile_schema.py
+++ b/ml/features/test_profile_schema.py
@@ -1,9 +1,11 @@
-"""Smoke test for profile_schema mirror (#81 phase A).
+"""Smoke test for profile_schema mirror (#81 phase A, #61 freshness spec).
 The TS registry in services/api/src/profile/registry.ts is the source of truth.
 This test checks the names listed here match the registry by reading the TS
 file and grepping for `name: '...'`. Crude but cheap, and it catches the
 common rename/add-without-mirror failure mode.
 Also verifies invalidated_by subjects mirror the TS invalidatedBy arrays (#61).
 """
 from __future__ import annotations
 import re
@@ -14,6 +16,18 @@ from ml.features.profile_schema import PROFILE_FEATURES, feature_names
 REGISTRY_PATH = Path(__file__).resolve().parents[2] / "services" / "api" / "src" / "profile" / "registry.ts"
 _HOUR = 3600
 _DAY = 86_400
 # Expected ttl_sec values mirrored from registry.ts — keeps the two in sync.
 _EXPECTED_TTL: dict[str, int] = {
    "completion_rate_30d": 6 * _HOUR,
    "dismiss_rate_30d":    6 * _HOUR,
    "mean_dwell_ms_30d":   6 * _HOUR,
    "preferred_hour":      _DAY,
    "tip_volume_30d":      _HOUR,
 }
 def _ts_registry_names() -> set[str]:
    text = REGISTRY_PATH.read_text(encoding="utf-8")
@@ -21,6 +35,35 @@ def _ts_registry_names() -> set[str]:
    return set(re.findall(r"name:\s*'([a-zA-Z0-9_]+)'", text))
 def _ts_registry_ttls() -> dict[str, int]:
    """Parse ttlSec values from registry.ts (crude but sufficient for drift detection).
    Handles TS symbolic constants (HOUR, DAY) and expressions like ``6 * HOUR``.
    """
    text = REGISTRY_PATH.read_text(encoding="utf-8")
    # Extract numeric constants: `const HOUR = 3600;` or `const DAY = 86_400;`
    consts: dict[str, int] = {}
    for m in re.finditer(r"const\s+([A-Z_]+)\s*=\s*([\d_]+)", text):
        consts[m.group(1)] = int(m.group(2).replace("_", ""))
    def _eval_expr(expr: str) -> int:
        tokens = [t.strip() for t in expr.split("*")]
        result = 1
        for t in tokens:
            result *= consts[t] if t in consts else int(t)
        return result
    result: dict[str, int] = {}
    for block in re.split(r"\{", text):
        name_m = re.search(r"name:\s*'([a-zA-Z0-9_]+)'", block)
        # ttlSec may be a constant name, a number, or `N * CONST`
        ttl_m = re.search(r"ttlSec:\s*([A-Za-z0-9_]+(?:\s*\*\s*[A-Za-z0-9_]+)?)", block)
        if name_m and ttl_m:
            result[name_m.group(1)] = _eval_expr(ttl_m.group(1))
    return result
 def test_python_mirror_matches_ts_registry():
    py_names = feature_names()
    ts_names = _ts_registry_names()
@@ -39,3 +82,68 @@ def test_profile_schema_no_duplicates():
 def test_profile_schema_dtypes_known():
    for f in PROFILE_FEATURES:
        assert f.dtype in {"numeric", "categorical"}
 def test_all_profile_features_are_batched():
    for f in PROFILE_FEATURES:
        assert f.freshness == "batched", f"{f.name}: expected freshness='batched', got {f.freshness!r}"
 def test_profile_feature_ttl_matches_ts_registry():
    ts_ttls = _ts_registry_ttls()
    for f in PROFILE_FEATURES:
        assert f.name in ts_ttls, f"{f.name} not found in TS registry ttlSec parse"
        assert f.ttl_sec == ts_ttls[f.name], (
            f"{f.name}: Python ttl_sec={f.ttl_sec} != TS ttlSec={ts_ttls[f.name]}"
        )
 def test_profile_feature_ttl_matches_expected():
    for f in PROFILE_FEATURES:
        assert f.ttl_sec == _EXPECTED_TTL[f.name], (
            f"{f.name}: ttl_sec={f.ttl_sec}, expected {_EXPECTED_TTL[f.name]}"
        )
 def test_profile_feature_source_is_profile_store():
    for f in PROFILE_FEATURES:
        assert f.source == "profile_store", f"{f.name}: unexpected source {f.source!r}"
 def test_profile_feature_fallback_set():
    for f in PROFILE_FEATURES:
        assert f.fallback, f"{f.name}: fallback must not be empty"
 def _ts_registry_invalidated_by() -> dict[str, list[str]]:
    """Parse invalidatedBy arrays from registry.ts.
    Extracts subjects from blocks like:
        invalidatedBy: ['signals.tip.feedback'],
    Returns {feature_name: [subject, ...]}; features with no invalidatedBy get [].
    """
    text = REGISTRY_PATH.read_text(encoding="utf-8")
    result: dict[str, list[str]] = {}
    for block in re.split(r"\{", text):
        name_m = re.search(r"name:\s*'([a-zA-Z0-9_]+)'", block)
        if not name_m:
            continue
        name = name_m.group(1)
        inv_m = re.search(r"invalidatedBy:\s*\[([^\]]*)\]", block)
        if inv_m:
            subjects = re.findall(r"'([^']+)'", inv_m.group(1))
        else:
            subjects = []
        result[name] = subjects
    return result
 def test_invalidated_by_matches_ts_registry():
    ts_inv = _ts_registry_invalidated_by()
    for f in PROFILE_FEATURES:
        assert f.name in ts_inv, f"{f.name} not found in TS registry invalidatedBy parse"
        expected = tuple(sorted(ts_inv[f.name]))
        actual = tuple(sorted(f.invalidated_by))
        assert actual == expected, (
            f"{f.name}: Python invalidated_by={actual} != TS invalidatedBy={expected}"
        )
--- a/ml/serving/README.md
+++ b/ml/serving/README.md
@@ -0,0 +1,104 @@
 # ml/serving
 FastAPI online scorer, tip generator, and JetStream consumer.
 ## Contract
 | Endpoint | Description |
 |----------|-------------|
 | `POST /score` | LinUCB d=5 (baseline, shadow-eligible) |
 | `POST /score/egreedy` | ε-greedy v1, d=7 (active policy — ADR-0007) |
 | `POST /score/egreedy/v2` | ε-greedy v2, d=12 + profile features (shadow — ADR-0012) |
 | `POST /reward` / `/reward/egreedy` / `/reward/egreedy/v2` | Online reward update per policy |
 | `POST /generate` | LLM tip candidates via LiteLLM `tip-generator` alias |
 | `GET /stats/{user_id}` / `/stats/egreedy/{user_id}` / `/stats/egreedy/v2/{user_id}` | Per-user policy stats |
 | `GET /features/{user_id}` | Last 100 scored feature vectors (ring buffer) |
 | `POST /reset/{user_id}` | Clear all per-user bandit state (admin) |
 | `GET /health` | `{ ok, nats: { enabled, consumers: { signals, feedback } } }` |
 Called by `services/api/src/recommender/` over HTTP. Contract is stable across policy swaps.
 ## Feature dimensions
 | Policy | d | Extra dims vs previous |
 |--------|---|------------------------|
 | LinUCB v1 | 5 | hour_sin/cos, is_overdue, task_age, priority |
 | ε-greedy v1 | 7 | + dow_sin/cos |
 | ε-greedy v2 | 12 | + 5 profile features (ADR-0012) |
 Profile features are computed by the TypeScript API and shipped on each `/score` call as `profile_features`. See `ml/README.md` and ADR-0011.
 ## JetStream consumers
 On startup, `nats_consumer.py` registers two durable push consumers against NATS JetStream:
 | Consumer | Stream | Subjects | Durable name |
 |----------|--------|----------|--------------|
 | signals | `signals` | `signals.>` | `feature-pipeline-signals` |
 | feedback | `feedback` | `feedback.>` | `feature-pipeline-feedback` |
 **Handled subjects:**
 - `signals.task.synced` — writes `{last_sync_ts, task_count}` to `{STATE_DIR}/{user}_sync.json`
 - `signals.tip.feedback` — logged for observability; reward update happens via the HTTP path in the recommender
 **Payload validation:** each message is validated against the pydantic models in `schemas.py` (mirroring `packages/shared-types/events/oo/events/v1/`). A `ValidationError` triggers a nak so the message is redelivered rather than silently dropped.
 **Ack semantics:** explicit ack on success; nak for redelivery on error; dead-lettered after `NATS_MAX_DELIVER` attempts.
 **Disabled** when `NATS_URL` is unset (default in local dev without NATS). No import of `nats-py` occurs in that case.
 ## Observability
 Logs are structured JSON via **structlog**. Every line includes `level`, `logger`, `timestamp`, and — when a W3C `traceparent` header is present on the incoming request — `trace_id` bound via Python `contextvars`, so all log lines within a request carry the same trace ID as the upstream API call.
 Sentry error capture is active when `SENTRY_DSN` is set.
 ## Config
 | Env var | Default | Description |
 |---------|---------|-------------|
 | `STATE_DIR` | `/tmp/oo-bandit-state` | Directory for per-user bandit state JSON files |
 | `LITELLM_URL` | `http://localhost:4000` | LiteLLM gateway |
 | `LITELLM_MASTER_KEY` | `sk-oo-dev` | LiteLLM auth key |
 | `NATS_URL` | `` | NATS broker URL; empty = consumers disabled |
 | `NATS_DURABLE_PREFIX` | `feature-pipeline` | Prefix for durable consumer names |
 | `NATS_MAX_DELIVER` | `5` | Max redelivery attempts before dropping |
 | `DEFAULT_PROMPT_VERSION` | `v1` | Fallback prompt version for `/generate` |
 | `ENV` | `development` | Environment label (passed to Sentry) |
 | `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
 ## Health story
 `GET /health` returns `{ ok: true }` plus NATS consumer state:
 ```json
 {
  "ok": true,
  "nats": {
    "enabled": true,
    "consumers": {
      "signals": { "last_msg_ts": "2026-04-25T10:00:00Z", "processed": 42, "errors": 0 },
      "feedback": { "last_msg_ts": null, "processed": 0, "errors": 0 }
    }
  }
 }
 ```
 `last_msg_ts` is `null` until the first message arrives. Used by docker-compose healthcheck.
 ## Extraction criteria
 Extract to its own process (already is one). Extract to a dedicated host / GPU node when:
 - p99 scoring latency exceeds 50 ms under load, **or**
 - model weights are too large to share memory with the Python process on the current host.
 ## State
 Per-user bandit state is stored as JSON files in `STATE_DIR`:
 | File pattern | Policy |
 |---|---|
 | `{user}.json` | LinUCB v1 |
 | `{user}_egreedy.json` | ε-greedy v1 |
 | `{user}_egreedy_v2.json` | ε-greedy v2 |
 | `{user}_sync.json` | Last task sync metadata (written by JetStream consumer) |
--- a/ml/serving/logging_config.py
+++ b/ml/serving/logging_config.py
@@ -0,0 +1,19 @@
 """Structlog JSON configuration — import once at process start."""
 import logging
 import structlog
 def configure() -> None:
    structlog.configure(
        processors=[
            structlog.contextvars.merge_contextvars,
            structlog.stdlib.add_log_level,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.StackInfoRenderer(),
            structlog.processors.JSONRenderer(),
        ],
        wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
        context_class=dict,
        logger_factory=structlog.PrintLoggerFactory(),
    )
    logging.basicConfig(level=logging.WARNING)
--- a/ml/serving/main.py
+++ b/ml/serving/main.py
--- a/ml/serving/mlflow_client.py
+++ b/ml/serving/mlflow_client.py
@@ -0,0 +1,201 @@
 """Thin MLflow REST wrapper.
 Why not the official ``mlflow`` SDK? Two reasons specific to the oO setup:
 1. The MLflow server (3.11) ships with ``--allowed-hosts localhost`` but
   curl / requests / urllib3 send ``Host: localhost:5000`` — the port
   suffix fails the DNS-rebinding check. We override the Host header per
   request, which the SDK doesn't expose.
 2. The collect/judge phases only need ~6 endpoints (create/search/log).
   Pulling a 200MB SDK transitively for that is excess weight.
 All calls are synchronous httpx with explicit ``Host`` so the script can
 run from the host shell or from inside docker without further config.
 """
 from __future__ import annotations
 import os
 import time
 from dataclasses import dataclass
 from typing import Any
 import httpx
 def _strip_path(uri: str) -> tuple[str, str]:
    """Return (origin, path_prefix) — handles both /mlflow and / roots.
    ``http://mlflow:5000/mlflow``  → ("http://mlflow:5000", "/mlflow")
    ``http://localhost:5000``      → ("http://localhost:5000", "")
    """
    uri = uri.rstrip("/")
    if "/" not in uri.split("://", 1)[1]:
        return uri, ""
    scheme_host, _, rest = uri.partition("://")
    host, _, path = rest.partition("/")
    return f"{scheme_host}://{host}", "/" + path if path else ""
@dataclass
 class MLflowClient:
    tracking_uri: str
    username: str | None = None
    password: str | None = None
    host_header: str | None = None  # override for DNS-rebinding sidestep
    timeout: float = 30.0
    def __post_init__(self) -> None:
        self._origin, self._ui_prefix = _strip_path(self.tracking_uri)
        # MLflow 3.x exposes the REST API at the root, *not* under the
        # ``/mlflow`` UI prefix. Empirically verified against the running
        # ghcr.io/mlflow/mlflow:v3.11.1 container.
        self._api = f"{self._origin}/api/2.0/mlflow"
        self._auth = (self.username, self.password) if self.username else None
        # If user did not pass a host header, derive from origin. Strip
        # the port if present — the server's allowed-hosts check rejects
        # ``localhost:5000`` even when ``localhost`` is allowed.
        if self.host_header is None:
            host = self._origin.split("://", 1)[1]
            self.host_header = host.split(":", 1)[0]
    @classmethod
    def from_env(cls) -> "MLflowClient":
        return cls(
            tracking_uri=os.environ.get("MLFLOW_TRACKING_URI", "http://localhost:5000"),
            username=os.environ.get("MLFLOW_TRACKING_USERNAME") or "admin",
            password=os.environ.get("MLFLOW_TRACKING_PASSWORD") or "password",
            host_header=os.environ.get("MLFLOW_HOST_HEADER"),
        )
    def _headers(self) -> dict[str, str]:
        return {"Host": self.host_header or "localhost"}
    def _post(self, path: str, body: dict) -> dict:
        with httpx.Client(trust_env=False, timeout=self.timeout) as c:
            r = c.post(f"{self._api}{path}", json=body, headers=self._headers(), auth=self._auth)
            r.raise_for_status()
            return r.json()
    def _get(self, path: str, params: dict | None = None) -> dict:
        with httpx.Client(trust_env=False, timeout=self.timeout) as c:
            r = c.get(f"{self._api}{path}", params=params or {}, headers=self._headers(), auth=self._auth)
            r.raise_for_status()
            return r.json()
    # ── Experiments ────────────────────────────────────────────────────
    def get_or_create_experiment(self, name: str) -> str:
        try:
            r = self._get("/experiments/get-by-name", {"experiment_name": name})
            return r["experiment"]["experiment_id"]
        except httpx.HTTPStatusError as e:
            if e.response.status_code not in (404, 400):
                raise
        r = self._post("/experiments/create", {"name": name})
        return r["experiment_id"]
    # ── Runs ───────────────────────────────────────────────────────────
    def create_run(
        self,
        experiment_id: str,
        run_name: str,
        tags: dict[str, str] | None = None,
    ) -> str:
        body: dict[str, Any] = {
            "experiment_id": experiment_id,
            "start_time": int(time.time() * 1000),
            "run_name": run_name,
            "tags": [
                {"key": k, "value": str(v)}
                for k, v in (tags or {}).items()
            ],
        }
        r = self._post("/runs/create", body)
        return r["run"]["info"]["run_id"]
    def log_param(self, run_id: str, key: str, value: Any) -> None:
        self._post("/runs/log-parameter", {"run_id": run_id, "key": key, "value": str(value)})
    def log_params(self, run_id: str, params: dict[str, Any]) -> None:
        for k, v in params.items():
            self.log_param(run_id, k, v)
    def log_metric(self, run_id: str, key: str, value: float, step: int = 0) -> None:
        self._post("/runs/log-metric", {
            "run_id": run_id,
            "key": key,
            "value": float(value),
            "timestamp": int(time.time() * 1000),
            "step": step,
        })
    def log_metrics(self, run_id: str, metrics: dict[str, float]) -> None:
        for k, v in metrics.items():
            self.log_metric(run_id, k, v)
    def set_tag(self, run_id: str, key: str, value: str) -> None:
        self._post("/runs/set-tag", {"run_id": run_id, "key": key, "value": str(value)})
    def set_tags(self, run_id: str, tags: dict[str, str]) -> None:
        for k, v in tags.items():
            self.set_tag(run_id, k, v)
    # MLflow tag values are capped at 5000 chars by the server (RESOURCE_DOES_NOT_EXIST
    # below that, INVALID_PARAMETER_VALUE above). 4500 leaves headroom for
    # internal metadata MLflow may append on its own.
    _TAG_VALUE_LIMIT = 4500
    def log_text(self, run_id: str, text: str, artifact_path: str) -> None:
        """Persist short text alongside the run.
        The MLflow server in this deployment uses a ``file://`` artifact
        backend, which is only reachable from inside the container — not
        via the REST proxy. We instead stash short payloads as tags
        keyed ``artifact:<path>``. Anything longer than 4500 chars is
        chunked into ``artifact:<path>:0``, ``:1`` …; ``get_artifact_text``
        re-stitches them in order.
        """
        key_base = f"artifact:{artifact_path}"
        if len(text) <= self._TAG_VALUE_LIMIT:
            self.set_tag(run_id, key_base, text)
            return
        # chunk
        for i in range(0, len(text), self._TAG_VALUE_LIMIT):
            self.set_tag(run_id, f"{key_base}:{i // self._TAG_VALUE_LIMIT}",
                          text[i:i + self._TAG_VALUE_LIMIT])
    def get_artifact_text(self, run_id: str, artifact_path: str) -> str:
        run = self._get("/runs/get", {"run_id": run_id})["run"]
        tags = {t["key"]: t["value"] for t in run["data"].get("tags", [])}
        key_base = f"artifact:{artifact_path}"
        if key_base in tags:
            return tags[key_base]
        # chunked form
        chunks = sorted(
            (k for k in tags if k.startswith(f"{key_base}:")),
            key=lambda k: int(k.rsplit(":", 1)[1]),
        )
        return "".join(tags[k] for k in chunks)
    def end_run(self, run_id: str, status: str = "FINISHED") -> None:
        self._post("/runs/update", {
            "run_id": run_id,
            "status": status,
            "end_time": int(time.time() * 1000),
        })
    def search_runs(
        self,
        experiment_id: str,
        filter_string: str = "",
        max_results: int = 1000,
    ) -> list[dict]:
        body = {
            "experiment_ids": [experiment_id],
            "filter": filter_string,
            "max_results": max_results,
        }
        r = self._post("/runs/search", body)
        return r.get("runs", [])
--- a/ml/serving/nats_consumer.py
+++ b/ml/serving/nats_consumer.py
@@ -0,0 +1,146 @@
 """
 JetStream durable consumers for ml/serving.
 Streams:
  signals  (subjects: signals.>) — durable: {prefix}-signals
  feedback (subjects: feedback.>) — durable: {prefix}-feedback
 Handled subjects:
  signals.task.synced   → write per-user sync metadata to STATE_DIR
  signals.tip.feedback  → log for observability (reward is applied via HTTP path)
 Config (env vars):
  NATS_URL            — broker URL; empty = consumers disabled (default: "")
  NATS_DURABLE_PREFIX — prefix for durable consumer names (default: "feature-pipeline")
  NATS_MAX_DELIVER    — max redelivery attempts before dropping (default: 5)
 """
 from __future__ import annotations
 import json
 import os
 import time
 from pathlib import Path
 from typing import Optional
 import structlog
 from schemas import TaskSyncedPayload, TipFeedbackPayload
 log = structlog.get_logger(__name__)
 NATS_URL = os.getenv("NATS_URL", "")
 NATS_DURABLE_PREFIX = os.getenv("NATS_DURABLE_PREFIX", "feature-pipeline")
 NATS_MAX_DELIVER = int(os.getenv("NATS_MAX_DELIVER", "5"))
 # Exposed to /health
 consumer_health: dict[str, dict] = {
    "signals": {"last_msg_ts": None, "processed": 0, "errors": 0},
    "feedback": {"last_msg_ts": None, "processed": 0, "errors": 0},
 }
 _nc = None          # nats.aio.Client
 _subs: list = []    # active JetStream subscriptions
 # ── Subject handlers ───────────────────────────────────────────────────────
 def _sync_meta_path(state_dir: Path, user_id: str) -> Path:
    safe = "".join(c if c.isalnum() else "_" for c in user_id)
    return state_dir / f"{safe}_sync.json"
 async def _handle(subject: str, payload: dict, state_dir: Path) -> None:
    if subject == "signals.task.synced":
        msg = TaskSyncedPayload.model_validate(payload)
        p = _sync_meta_path(state_dir, msg.userId)
        p.write_text(json.dumps({
            "last_sync_ts": msg.syncedAt,
            "task_count": msg.count,
        }))
        log.info("nats: task_synced", user_id=msg.userId, count=msg.count)
    elif subject == "signals.tip.feedback":
        msg = TipFeedbackPayload.model_validate(payload)
        log.info("nats: tip_feedback", user_id=msg.userId, tip_id=msg.tipId, action=msg.action, reward=msg.reward)
    else:
        log.debug("nats: unhandled subject", subject=subject)
 # ── Consumer factory ───────────────────────────────────────────────────────
 def _make_handler(key: str, state_dir: Path):
    """Return an async push-consumer callback that acks on success, naks on error."""
    async def handler(msg) -> None:
        consumer_health[key]["last_msg_ts"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
        try:
            payload = json.loads(msg.data)
            await _handle(msg.subject, payload, state_dir)
            await msg.ack()
            consumer_health[key]["processed"] += 1
        except Exception as exc:
            consumer_health[key]["errors"] += 1
            log.warning("nats: processing error", key=key, subject=msg.subject, exc=str(exc))
            await msg.nak()
    return handler
 # ── Lifecycle ──────────────────────────────────────────────────────────────
 async def start(state_dir: Path) -> None:
    """Connect to NATS and register durable push consumers. No-op if NATS_URL is unset."""
    global _nc
    if not NATS_URL:
        log.info("nats: NATS_URL unset — JetStream consumers disabled")
        return
    try:
        import nats as nats_lib
        from nats.js.api import ConsumerConfig, AckPolicy
        _nc = await nats_lib.connect(
            NATS_URL,
            name="ml-serving",
            reconnect_time_wait=5,
            max_reconnect_attempts=-1,
        )
        js = _nc.jetstream()
        log.info("nats: connected", url=NATS_URL)
    except Exception as exc:
        log.warning("nats: connection failed — consumers disabled", exc=str(exc))
        _nc = None
        return
    config = ConsumerConfig(
        ack_policy=AckPolicy.EXPLICIT,
        max_deliver=NATS_MAX_DELIVER,
    )
    for key, subject in [("signals", "signals.>"), ("feedback", "feedback.>")]:
        durable = f"{NATS_DURABLE_PREFIX}-{key}"
        try:
            sub = await js.subscribe(
                subject,
                durable=durable,
                cb=_make_handler(key, state_dir),
                config=config,
            )
            _subs.append(sub)
            log.info("nats: subscribed", subject=subject, durable=durable)
        except Exception as exc:
            log.warning("nats: subscribe failed", key=key, exc=str(exc))
 async def stop() -> None:
    """Drain subscriptions and close NATS connection."""
    global _nc
    for sub in _subs:
        try:
            await sub.unsubscribe()
        except Exception:
            pass
    _subs.clear()
    if _nc:
        try:
            await _nc.drain()
        except Exception:
            pass
        _nc = None
        log.info("nats: disconnected")
--- a/ml/serving/prompts.py
+++ b/ml/serving/prompts.py
@@ -23,6 +23,7 @@ class _Ctx(Protocol):
    hour_of_day: int
    day_of_week: int
    extra: dict
    profile_features: "dict | None"
@dataclass(frozen=True)
@@ -33,13 +34,29 @@ class Prompt:
 def _base_user_lines(ctx: "_Ctx") -> list[str]:
    # Overdue tasks first, then high-priority, then oldest — most actionable context at top
    tasks = sorted(
        ctx.tasks,
        key=lambda t: (not t.get("is_overdue", False), -t.get("priority", 1), -t.get("task_age_days", 0.0)),
    )
    lines = [f"Time: {ctx.hour_of_day:02d}:00, day_of_week={ctx.day_of_week}"]
-    if ctx.tasks:
+    if tasks:
-        overdue = [t for t in ctx.tasks if t.get("is_overdue")]
+        overdue = [t for t in tasks if t.get("is_overdue")]
-        lines.append(f"Tasks: {len(ctx.tasks)} total, {len(overdue)} overdue")
+        lines.append(f"Tasks: {len(tasks)} total, {len(overdue)} overdue")
-        for t in ctx.tasks[:5]:
+        for t in tasks[:5]:
            due = t.get("due_date", "no due date")
            lines.append(f"  - [{t.get('priority','?')}] {t.get('content','?')} (due: {due})")
    p = getattr(ctx, "profile_features", None) or {}
    if p:
        parts: list[str] = []
        if (v := p.get("completion_rate_30d")) is not None:
            parts.append(f"completion_rate={float(v):.0%}")
        if (v := p.get("dismiss_rate_30d")) is not None:
            parts.append(f"dismiss_rate={float(v):.0%}")
        if (v := p.get("preferred_hour")) is not None:
            parts.append(f"preferred_hour={int(v):02d}:00")
        if parts:
            lines.append(f"User profile: {', '.join(parts)}")
    for k, v in ctx.extra.items():
        lines.append(f"{k}: {v}")
    return lines
@@ -91,6 +108,93 @@ PROMPTS: dict[str, Prompt] = {
 }
 # ── v4-orchestrator ────────────────────────────────────────────────────────
 # Not a Prompt entry — takes pre-computed agent snippets, not a _Ctx.
 _SYS_V4_ORCHESTRATOR = (
    "You are a personal advisor generating a single, perfectly-timed tip. "
    "Multiple specialized agents have analyzed the user's current context and provided "
    "their insights below. Synthesize their combined perspective to generate exactly ONE "
    "tip that is specific, actionable, and relevant right now. "
    "Always respond in English regardless of the language of task content. "
    "Respond ONLY with a JSON object with keys: "
    '"id" (short slug), "content" (the tip, ≤2 sentences), '
    '"rationale" (why now, ≤1 sentence). '
    "No markdown, no prose outside the JSON object."
 )
 def _science_destiny_instruction(science_destiny: int) -> str:
    """Translate 0-100 slider into a prompt instruction.
    0   = pure science: prioritise patterns, data, measurable progress.
    100 = pure destiny: prioritise meaning, intuition, deeper purpose.
    50  = balanced (no extra instruction injected).
    """
    if science_destiny <= 20:
        return (
            "The user strongly prefers data-driven advice. "
            "Ground every tip in observable patterns, streaks, or measurable progress. "
            "Avoid abstract or motivational language."
        )
    if science_destiny <= 40:
        return (
            "The user leans toward evidence-based guidance. "
            "Anchor tips in patterns and metrics where possible."
        )
    if science_destiny >= 80:
        return (
            "The user strongly believes in intuition and meaning. "
            "Frame tips around purpose, values, and deeper intention rather than metrics."
        )
    if science_destiny >= 60:
        return (
            "The user leans toward intuitive, meaning-driven advice. "
            "Weave in purpose and intention alongside practicality."
        )
    return ""  # balanced — no extra instruction
 def build_orchestrator_messages(
    agent_outputs: list[dict],
    tasks: list[dict],
    hour_of_day: int,
    day_of_week: int,
    science_destiny: int = 50,
    recent_tip: str | None = None,
 ) -> list[dict]:
    """Build the [system, user] message list for the orchestrator LLM call.
    agent_outputs: list of {agent_id, prompt_text} dicts.
    Falls back to raw task summary when agent_outputs is empty.
    recent_tip: content of a tip the user just snoozed — generate something different.
    """
    style_hint = _science_destiny_instruction(science_destiny)
    system = _SYS_V4_ORCHESTRATOR + (f"\n\n{style_hint}" if style_hint else "")
    lines = [f"Current time: {hour_of_day:02d}:00, day_of_week={day_of_week}", ""]
    if recent_tip:
        lines.append(f"The user snoozed this tip (do NOT repeat it or anything similar): \"{recent_tip}\"")
        lines.append("")
    if agent_outputs:
        lines.append("Context from analysis agents:")
        for s in agent_outputs:
            lines.append(f"[{s['agent_id']}] {s['prompt_text']}")
    else:
        overdue = [t for t in tasks if t.get("is_overdue")]
        lines.append(
            f"No pre-computed agent context available. "
            f"Tasks: {len(tasks)} total, {len(overdue)} overdue."
        )
        for t in tasks[:3]:
            lines.append(f"  - {t.get('content', '?')}")
    lines.append("\nGenerate one tip as a JSON object. Write the tip content in English only.")
    return [
        {"role": "system", "content": system},
        {"role": "user", "content": "\n".join(lines)},
    ]
 def default_version() -> str:
    return os.getenv("DEFAULT_PROMPT_VERSION", "v1")
--- a/ml/serving/requirements.txt
+++ b/ml/serving/requirements.txt
@@ -4,3 +4,8 @@ pydantic==2.10.4
 numpy>=1.26.0
 httpx>=0.27.0
 anthropic>=0.40.0
 nats-py>=2.9.0
 structlog>=24.1.0
 sentry-sdk>=2.0.0
 mlflow-skinny>=3.1.0
 pyswisseph>=2.10.3.2
--- a/ml/serving/schemas.py
+++ b/ml/serving/schemas.py
@@ -0,0 +1,50 @@
 """
 Pydantic models mirroring oo.events.v1 proto schemas.
 Field names use camelCase to match the proto3 JSON mapping convention
 and the TypeScript payload shapes published by services/api.
 Keep in sync with packages/shared-types/events/oo/events/v1/.
 """
 from __future__ import annotations
 from typing import Literal, Optional
 from pydantic import BaseModel
 class TaskSyncedPayload(BaseModel):
    userId: str
    source: str
    count: int
    syncedAt: str
 class TipServedPayload(BaseModel):
    userId: str
    tipId: str
    policy: str
    servedAt: str
 class TipFeedbackPayload(BaseModel):
    userId: str
    tipId: str
    action: Literal['done', 'dismiss', 'snooze', 'helpful', 'not_helpful']
    reward: float
    dwellMs: Optional[int] = None
    createdAt: str
 class TipRewardFailedPayload(BaseModel):
    userId: str
    tipId: str
    reward: float
    attempts: int
    error: str
    failedAt: str
 class IntegrationTokenExpiredPayload(BaseModel):
    userId: str
    provider: str
    detectedAt: str
--- a/ml/serving/tests/test_generate.py
+++ b/ml/serving/tests/test_generate.py
@@ -127,6 +127,46 @@ def test_build_prompt_empty_tasks_no_task_line():
    assert "Generate 2 tips" in prompt
 def test_build_prompt_tasks_sorted_overdue_first():
    tasks = [
        {"content": "Low priority", "priority": 1, "is_overdue": False, "task_age_days": 0},
        {"content": "Overdue task", "priority": 2, "is_overdue": True, "task_age_days": 3},
    ]
    ctx = PromptContext(tasks=tasks, hour_of_day=9)
    prompt = _build_user_v1(ctx, n=2)
    assert prompt.index("Overdue task") < prompt.index("Low priority")
 def test_build_prompt_includes_profile_features():
    ctx = PromptContext(
        tasks=[],
        hour_of_day=14,
        profile_features={"completion_rate_30d": 0.75, "dismiss_rate_30d": 0.1, "preferred_hour": 9},
    )
    prompt = _build_user_v1(ctx, n=1)
    assert "User profile:" in prompt
    assert "completion_rate=75%" in prompt
    assert "dismiss_rate=10%" in prompt
    assert "preferred_hour=09:00" in prompt
 def test_build_prompt_no_profile_line_when_empty():
    ctx = PromptContext(tasks=[], hour_of_day=10, profile_features={})
    prompt = _build_user_v1(ctx, n=1)
    assert "User profile:" not in prompt
 def test_build_prompt_profile_partial_fields():
    ctx = PromptContext(
        tasks=[],
        hour_of_day=10,
        profile_features={"completion_rate_30d": 0.5},
    )
    prompt = _build_user_v1(ctx, n=1)
    assert "completion_rate=50%" in prompt
    assert "dismiss_rate" not in prompt
@pytest.mark.anyio
 async def test_generate_retry_succeeds_on_second_attempt():
    """First response is invalid JSON; second is valid. Should return 200."""
@@ -271,6 +311,38 @@ async def test_generate_echoes_selected_prompt_version():
    assert resp.json()["prompt_version"] == "v2-mentor"
@pytest.mark.anyio
 async def test_generate_passes_profile_features_to_prompt():
    """profile_features from GenerateRequest should appear in the user message sent to LiteLLM."""
    fake_items = [{"id": "tip-1", "content": "x", "rationale": "y"}]
    mock_resp = _litellm_response(fake_items)
    captured_payload: list[dict] = []
    async def _capture(url, *, json, headers):
        captured_payload.append(json)
        return mock_resp
    with patch("main.httpx.AsyncClient") as MockClient:
        instance = AsyncMock()
        instance.post = AsyncMock(side_effect=_capture)
        instance.__aenter__ = AsyncMock(return_value=instance)
        instance.__aexit__ = AsyncMock(return_value=False)
        MockClient.return_value = instance
        async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
            resp = await client.post("/generate", json={
                "user_id": "u1",
                "n": 1,
                "profile_features": {"completion_rate_30d": 0.8, "preferred_hour": 10},
            })
    assert resp.status_code == 200
    user_msg = captured_payload[0]["messages"][1]["content"]
    assert "User profile:" in user_msg
    assert "completion_rate=80%" in user_msg
    assert "preferred_hour=10:00" in user_msg
@pytest.mark.anyio
 async def test_generate_422_on_unknown_prompt_version():
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
--- a/ml/serving/tests/test_infer_endpoint.py
+++ b/ml/serving/tests/test_infer_endpoint.py
@@ -0,0 +1,52 @@
 """POST /agents/{agent_id}/infer — inference framework endpoint."""
 import pytest
 from httpx import AsyncClient, ASGITransport
 from main import app
@pytest.mark.anyio
 async def test_infer_time_of_day_cold_start():
    """Fewer than min_history events → cold_start_default for preferred_hour."""
    transport = ASGITransport(app=app)
    async with AsyncClient(transport=transport, base_url="http://test") as client:
        resp = await client.post("/agents/time-of-day/infer", json={
            "user_id": "u1",
            "feedback_history": [
                {"action": "done", "dwell_ms": 60000, "created_at": "2026-05-01T09:00:00+00:00"},
            ] * 5,  # 5 < min_history=10
        })
    assert resp.status_code == 200
    body = resp.json()
    assert body["agent_id"] == "time-of-day"
    assert body["inferred_prefs"]["preferred_hour"] is None
@pytest.mark.anyio
 async def test_infer_time_of_day_enough_history():
    """10+ events → preferred_hour is inferred as the mode done-hour."""
    events = [{"action": "done", "dwell_ms": 60000, "created_at": "2026-05-01T09:00:00+00:00"}] * 10
    transport = ASGITransport(app=app)
    async with AsyncClient(transport=transport, base_url="http://test") as client:
        resp = await client.post("/agents/time-of-day/infer", json={"user_id": "u1", "feedback_history": events})
    assert resp.status_code == 200
    body = resp.json()
    assert body["inferred_prefs"]["preferred_hour"] == 9
@pytest.mark.anyio
 async def test_infer_agent_with_no_inferred_params():
    """Agents with no inferred_params return an empty dict (focus-area has none)."""
    transport = ASGITransport(app=app)
    async with AsyncClient(transport=transport, base_url="http://test") as client:
        resp = await client.post("/agents/focus-area/infer", json={"user_id": "u1", "feedback_history": []})
    assert resp.status_code == 200
    assert resp.json()["inferred_prefs"] == {}
@pytest.mark.anyio
 async def test_infer_unknown_agent_404():
    transport = ASGITransport(app=app)
    async with AsyncClient(transport=transport, base_url="http://test") as client:
        resp = await client.post("/agents/ghost/infer", json={"user_id": "u1", "feedback_history": []})
    assert resp.status_code == 404
--- a/ml/serving/tests/test_registry_endpoint.py
+++ b/ml/serving/tests/test_registry_endpoint.py
@@ -0,0 +1,21 @@
 """GET /agents/registry — manifests are exposed in JSON-serialisable form."""
 import pytest
 from httpx import AsyncClient, ASGITransport
 from main import app
@pytest.mark.anyio
 async def test_registry_returns_all_agents():
    transport = ASGITransport(app=app)
    async with AsyncClient(transport=transport, base_url="http://test") as client:
        resp = await client.get("/agents/registry")
    assert resp.status_code == 200
    payload = resp.json()
    ids = {a["id"] for a in payload["agents"]}
    assert ids == {"overdue-task", "momentum", "time-of-day", "recent-patterns", "focus-area"}
    sample = payload["agents"][0]
    for key in ("id", "version", "description", "pref_schema", "required_consents", "ttl_sec"):
        assert key in sample
--- a/ml/serving/tests/test_schemas_and_consumer.py
+++ b/ml/serving/tests/test_schemas_and_consumer.py
@@ -0,0 +1,169 @@
 """
 Tests for schemas.py and nats_consumer._handle.
 """
 import json
 import pytest
 import tempfile
 from pathlib import Path
 from pydantic import ValidationError
 from unittest.mock import AsyncMock
 from schemas import (
    TaskSyncedPayload,
    TipServedPayload,
    TipFeedbackPayload,
    TipRewardFailedPayload,
    IntegrationTokenExpiredPayload,
 )
 from nats_consumer import _handle, _sync_meta_path
 # ── Schema validation ─────────────────────────────────────────────────────────
 class TestTaskSyncedPayload:
    def test_valid(self):
        p = TaskSyncedPayload.model_validate(
            {"userId": "u1", "source": "todoist", "count": 5, "syncedAt": "2026-04-25T10:00:00Z"}
        )
        assert p.userId == "u1"
        assert p.count == 5
    def test_missing_field_raises(self):
        with pytest.raises(ValidationError):
            TaskSyncedPayload.model_validate({"userId": "u1", "source": "todoist"})
    def test_wrong_type_raises(self):
        with pytest.raises(ValidationError):
            TaskSyncedPayload.model_validate(
                {"userId": "u1", "source": "todoist", "count": "not-an-int", "syncedAt": "2026-04-25T10:00:00Z"}
            )
 class TestTipFeedbackPayload:
    def test_valid_without_dwell(self):
        p = TipFeedbackPayload.model_validate(
            {"userId": "u1", "tipId": "t1", "action": "done", "reward": 1.0, "createdAt": "2026-04-25T10:00:00Z"}
        )
        assert p.dwellMs is None
    def test_valid_with_dwell(self):
        p = TipFeedbackPayload.model_validate(
            {"userId": "u1", "tipId": "t1", "action": "helpful", "reward": 0.5,
             "dwellMs": 3200, "createdAt": "2026-04-25T10:00:00Z"}
        )
        assert p.dwellMs == 3200
    def test_invalid_action_raises(self):
        with pytest.raises(ValidationError):
            TipFeedbackPayload.model_validate(
                {"userId": "u1", "tipId": "t1", "action": "like", "reward": 1.0, "createdAt": "2026-04-25T10:00:00Z"}
            )
    def test_all_valid_actions(self):
        for action in ("done", "dismiss", "snooze", "helpful", "not_helpful"):
            p = TipFeedbackPayload.model_validate(
                {"userId": "u1", "tipId": "t1", "action": action, "reward": 0.0, "createdAt": "2026-04-25T10:00:00Z"}
            )
            assert p.action == action
 class TestOtherPayloads:
    def test_tip_served(self):
        p = TipServedPayload.model_validate(
            {"userId": "u1", "tipId": "t1", "policy": "egreedy-v2", "servedAt": "2026-04-25T10:00:00Z"}
        )
        assert p.policy == "egreedy-v2"
    def test_tip_reward_failed(self):
        p = TipRewardFailedPayload.model_validate(
            {"userId": "u1", "tipId": "t1", "reward": 1.0, "attempts": 3,
             "error": "timeout", "failedAt": "2026-04-25T10:00:00Z"}
        )
        assert p.attempts == 3
    def test_integration_token_expired(self):
        p = IntegrationTokenExpiredPayload.model_validate(
            {"userId": "u1", "provider": "todoist", "detectedAt": "2026-04-25T10:00:00Z"}
        )
        assert p.provider == "todoist"
 # ── _handle behaviour ─────────────────────────────────────────────────────────
 TASK_SYNCED = {
    "userId": "user-abc",
    "source": "todoist",
    "count": 7,
    "syncedAt": "2026-04-25T10:00:00Z",
 }
 TIP_FEEDBACK = {
    "userId": "user-abc",
    "tipId": "tip-xyz",
    "action": "done",
    "reward": 1.0,
    "dwellMs": 4200,
    "createdAt": "2026-04-25T10:00:00Z",
 }
 class TestHandle:
    @pytest.mark.asyncio
    async def test_task_synced_writes_meta_file(self):
        with tempfile.TemporaryDirectory() as tmp:
            state_dir = Path(tmp)
            await _handle("signals.task.synced", TASK_SYNCED, state_dir)
            meta_path = _sync_meta_path(state_dir, "user-abc")
            assert meta_path.exists()
            data = json.loads(meta_path.read_text())
            assert data["task_count"] == 7
            assert data["last_sync_ts"] == "2026-04-25T10:00:00Z"
    @pytest.mark.asyncio
    async def test_task_synced_bad_payload_raises(self):
        with tempfile.TemporaryDirectory() as tmp:
            with pytest.raises(ValidationError):
                await _handle("signals.task.synced", {"userId": "u1"}, Path(tmp))
    @pytest.mark.asyncio
    async def test_tip_feedback_valid_does_not_raise(self):
        with tempfile.TemporaryDirectory() as tmp:
            # should log and return cleanly
            await _handle("signals.tip.feedback", TIP_FEEDBACK, Path(tmp))
    @pytest.mark.asyncio
    async def test_tip_feedback_bad_action_raises(self):
        bad = {**TIP_FEEDBACK, "action": "unknown"}
        with tempfile.TemporaryDirectory() as tmp:
            with pytest.raises(ValidationError):
                await _handle("signals.tip.feedback", bad, Path(tmp))
    @pytest.mark.asyncio
    async def test_unhandled_subject_is_ignored(self):
        with tempfile.TemporaryDirectory() as tmp:
            # should not raise for unknown subjects
            await _handle("signals.something.new", {"any": "data"}, Path(tmp))
    @pytest.mark.asyncio
    async def test_make_handler_acks_on_success(self):
        from nats_consumer import _make_handler
        with tempfile.TemporaryDirectory() as tmp:
            handler = _make_handler("signals", Path(tmp))
            msg = AsyncMock()
            msg.subject = "signals.task.synced"
            msg.data = json.dumps(TASK_SYNCED).encode()
            await handler(msg)
            msg.ack.assert_awaited_once()
            msg.nak.assert_not_awaited()
    @pytest.mark.asyncio
    async def test_make_handler_naks_on_validation_error(self):
        from nats_consumer import _make_handler
        with tempfile.TemporaryDirectory() as tmp:
            handler = _make_handler("signals", Path(tmp))
            msg = AsyncMock()
            msg.subject = "signals.task.synced"
            msg.data = json.dumps({"userId": "u1"}).encode()  # missing fields
            await handler(msg)
            msg.nak.assert_awaited_once()
            msg.ack.assert_not_awaited()
--- a/ml/serving/tests/test_score.py
+++ b/ml/serving/tests/test_score.py
@@ -1,439 +0,0 @@
 """
 Unit tests for ml/serving — feature building and scoring contract.
 Run with: pytest ml/serving/tests/
 """
 import math
 import pytest
 from httpx import AsyncClient, ASGITransport
 from main import (
    app,
    build_feature_vector,
    build_feature_vector_12,
    _norm_dwell,
    _norm_preferred_hour,
    _norm_rate,
    _norm_volume,
 )
 class TestFeatureVector:
    def test_shape(self):
        v = build_feature_vector({"hour_of_day": 8, "is_overdue": True, "task_age_days": 3, "priority": 3})
        assert v.shape == (5,)
    def test_hour_encoding_noon(self):
        v = build_feature_vector({"hour_of_day": 12})
        # sin(2π * 12/24) = sin(π) ≈ 0
        assert abs(v[0]) < 1e-10
        # cos(2π * 12/24) = cos(π) = -1
        assert abs(v[1] - (-1.0)) < 1e-10
    def test_hour_encoding_midnight(self):
        v = build_feature_vector({"hour_of_day": 0})
        # sin(0) = 0
        assert abs(v[0]) < 1e-10
        # cos(0) = 1
        assert abs(v[1] - 1.0) < 1e-10
    def test_hour_encoding_6am(self):
        v = build_feature_vector({"hour_of_day": 6})
        # sin(2π * 6/24) = sin(π/2) = 1
        assert abs(v[0] - 1.0) < 1e-10
        # cos(π/2) = 0
        assert abs(v[1]) < 1e-10
    def test_age_clipped_at_30(self):
        v_long = build_feature_vector({"task_age_days": 100})
        v_cap = build_feature_vector({"task_age_days": 30})
        assert v_long[3] == v_cap[3] == 1.0
    def test_age_zero(self):
        v = build_feature_vector({"task_age_days": 0})
        assert v[3] == pytest.approx(0.0)
    def test_age_15_days_normalised(self):
        v = build_feature_vector({"task_age_days": 15})
        assert v[3] == pytest.approx(0.5)
    def test_priority_normalised(self):
        v1 = build_feature_vector({"priority": 1})
        v4 = build_feature_vector({"priority": 4})
        assert v1[4] == pytest.approx(0.0)
        assert v4[4] == pytest.approx(1.0)
    def test_priority_2_and_3(self):
        v2 = build_feature_vector({"priority": 2})
        v3 = build_feature_vector({"priority": 3})
        assert v2[4] == pytest.approx(1 / 3)
        assert v3[4] == pytest.approx(2 / 3)
    def test_is_overdue_true(self):
        v = build_feature_vector({"is_overdue": True})
        assert v[2] == 1.0
    def test_is_overdue_false(self):
        v = build_feature_vector({"is_overdue": False})
        assert v[2] == 0.0
    def test_defaults_when_no_keys(self):
        v = build_feature_vector({})
        # hour=12 → sin(π)≈0, cos(π)=-1
        assert abs(v[0]) < 1e-10
        assert abs(v[1] - (-1.0)) < 1e-10
        assert v[2] == 0.0   # is_overdue=False
        assert v[3] == 0.0   # task_age_days=0
        assert v[4] == 0.0   # priority=1 → (1-1)/3=0
@pytest.mark.asyncio
 async def test_health():
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        r = await client.get("/health")
    assert r.status_code == 200
    assert r.json()["ok"] is True
@pytest.mark.asyncio
 async def test_score_returns_a_candidate():
    payload = {
        "user_id": "test-user",
        "candidates": [
            {"id": "t:1", "content": "Task A", "source": "todoist", "source_id": "1",
             "features": {"is_overdue": True, "task_age_days": 2, "priority": 3}},
            {"id": "t:2", "content": "Task B", "source": "todoist", "source_id": "2",
             "features": {"is_overdue": False, "task_age_days": 0, "priority": 1}},
        ],
        "context": {"hour_of_day": 9, "day_of_week": 1},
    }
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        r = await client.post("/score", json=payload)
    assert r.status_code == 200
    body = r.json()
    assert body["tip_id"] in {"t:1", "t:2"}
    assert "policy" in body
    assert body["policy"] == "linucb-v1"
    assert isinstance(body["score"], float)
@pytest.mark.asyncio
 async def test_score_single_candidate_always_selected():
    """With a single candidate there is no choice — it must be returned."""
    payload = {
        "user_id": "solo-user",
        "candidates": [
            {"id": "only:1", "content": "Only task", "source": "todoist",
             "features": {"is_overdue": False, "task_age_days": 0, "priority": 1}},
        ],
        "context": {"hour_of_day": 10, "day_of_week": 0},
    }
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        r = await client.post("/score", json=payload)
    assert r.status_code == 200
    assert r.json()["tip_id"] == "only:1"
@pytest.mark.asyncio
 async def test_score_empty_candidates_returns_422():
    payload = {"user_id": "u", "candidates": [], "context": {"hour_of_day": 9, "day_of_week": 1}}
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        r = await client.post("/score", json=payload)
    assert r.status_code == 422
@pytest.mark.asyncio
 async def test_reward_accepted():
    payload = {
        "user_id": "reward-user",
        "tip_id": "t:1",
        "reward": 1.0,
        "features": {"hour_of_day": 9, "is_overdue": True, "task_age_days": 2, "priority": 3},
    }
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        r = await client.post("/reward", json=payload)
    assert r.status_code == 200
    assert r.json()["ok"] is True
@pytest.mark.asyncio
 async def test_reward_updates_stats():
    """Posting a reward should increase cumulative_reward in /stats."""
    user_id = "reward-stats-user"
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        r0 = await client.get(f"/stats/{user_id}")
        before = r0.json()["cumulative_reward"]
        await client.post("/reward", json={
            "user_id": user_id,
            "tip_id": "tip:x",
            "reward": 1.0,
            "features": {"hour_of_day": 8, "is_overdue": False, "task_age_days": 0, "priority": 2},
        })
        r1 = await client.get(f"/stats/{user_id}")
    assert r1.json()["cumulative_reward"] == pytest.approx(before + 1.0)
@pytest.mark.asyncio
 async def test_score_increments_pulls():
    user_id = "pull-counter-user"
    payload = {
        "user_id": user_id,
        "candidates": [
            {"id": "t:p1", "content": "Pull task", "source": "todoist",
             "features": {"is_overdue": False, "task_age_days": 1, "priority": 2}},
        ],
        "context": {"hour_of_day": 10, "day_of_week": 2},
    }
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        r0 = await client.get(f"/stats/{user_id}")
        pulls_before = r0.json()["pulls"]
        await client.post("/score", json=payload)
        await client.post("/score", json=payload)
        r1 = await client.get(f"/stats/{user_id}")
    assert r1.json()["pulls"] == pulls_before + 2
@pytest.mark.asyncio
 async def test_reset_clears_state():
    user_id = "reset-user"
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        # Score once to build state
        await client.post("/score", json={
            "user_id": user_id,
            "candidates": [
                {"id": "t:r", "content": "Reset task", "source": "todoist",
                 "features": {"is_overdue": True, "task_age_days": 5, "priority": 4}},
            ],
            "context": {"hour_of_day": 14, "day_of_week": 3},
        })
        r_reset = await client.post(f"/reset/{user_id}")
        assert r_reset.json()["ok"] is True
        r_stats = await client.get(f"/stats/{user_id}")
    assert r_stats.json()["pulls"] == 0
@pytest.mark.asyncio
 async def test_features_endpoint_returns_history():
    user_id = "features-user"
    payload = {
        "user_id": user_id,
        "candidates": [
            {"id": "t:f1", "content": "Feature task", "source": "todoist",
             "features": {"is_overdue": False, "task_age_days": 0, "priority": 1}},
        ],
        "context": {"hour_of_day": 7, "day_of_week": 0},
    }
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        await client.post("/score", json=payload)
        r = await client.get(f"/features/{user_id}")
    body = r.json()
    assert r.status_code == 200
    assert "history" in body
    assert len(body["history"]) >= 1
    entry = body["history"][-1]
    assert "ts" in entry
    assert "score" in entry
    assert "tip_id" in entry
@pytest.mark.asyncio
 async def test_stats_for_fresh_user():
    """A user with no history should return zero/default stats without error."""
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        r = await client.get("/stats/brand-new-user-xyz-abc")
    body = r.json()
    assert r.status_code == 200
    assert body["pulls"] == 0
    assert body["cumulative_reward"] == 0.0
    assert body["estimated_mean_reward"] == 0.0
 class TestV2Normalization:
    def test_rate_passthrough(self):
        assert _norm_rate(0.0) == 0.0
        assert _norm_rate(0.42) == 0.42
        assert _norm_rate(1.0) == 1.0
    def test_rate_none_zero(self):
        assert _norm_rate(None) == 0.0
    def test_rate_clipped(self):
        assert _norm_rate(1.5) == 1.0
        assert _norm_rate(-0.1) == 0.0
    def test_dwell_none_zero(self):
        assert _norm_dwell(None) == 0.0
    def test_dwell_scales_to_0_1(self):
        assert _norm_dwell(0) == 0.0
        # 600_000 ms (10 min) is the clip ceiling
        assert _norm_dwell(600_000) == 1.0
        assert _norm_dwell(1_200_000) == 1.0
        assert _norm_dwell(60_000) == pytest.approx(0.1)
    def test_volume_monotonic_and_clipped(self):
        assert _norm_volume(None) == 0.0
        assert _norm_volume(0) == 0.0
        assert _norm_volume(10) < _norm_volume(100)
        # 100 tips ≈ full saturation
        assert _norm_volume(100) == pytest.approx(1.0)
        assert _norm_volume(10_000) == 1.0
    def test_preferred_hour_alignment(self):
        # Exact match → 1.0
        assert _norm_preferred_hour(9, 9) == pytest.approx(1.0)
        # 12h opposite → 0.0
        assert _norm_preferred_hour(21, 9) == pytest.approx(0.0, abs=1e-10)
        # 6h off → 0.5 (cos(π/2) = 0, scaled to 0.5)
        assert _norm_preferred_hour(15, 9) == pytest.approx(0.5, abs=1e-10)
    def test_preferred_hour_null_neutral(self):
        # Null preference → neutral 0.5 rather than misleading "alignment at 0"
        assert _norm_preferred_hour(None, 9) == 0.5
 class TestFeatureVector12:
    def test_shape(self):
        v = build_feature_vector_12(
            {"hour_of_day": 9, "is_overdue": True, "task_age_days": 2, "priority": 3},
            day_of_week=2,
            profile={
                "completion_rate_30d": 0.5,
                "dismiss_rate_30d": 0.1,
                "mean_dwell_ms_30d": 60_000,
                "preferred_hour": 9,
                "tip_volume_30d": 20,
            },
        )
        assert v.shape == (12,)
    def test_first_seven_match_v1(self):
        """v2 must reduce to v1-style features on the first 7 dims so rollout
        behaviour is predictable when profile is absent."""
        from main import build_feature_vector_7
        feat = {"hour_of_day": 14, "is_overdue": True, "task_age_days": 5, "priority": 2}
        v1 = build_feature_vector_7(feat, day_of_week=3)
        v2 = build_feature_vector_12(feat, day_of_week=3, profile=None)
        assert (v1 == v2[:7]).all()
    def test_missing_profile_defaults(self):
        v = build_feature_vector_12({"hour_of_day": 9}, day_of_week=0, profile=None)
        # completion, dismiss, dwell, volume → 0; preferred_hour → 0.5 neutral
        assert v[7] == 0.0
        assert v[8] == 0.0
        assert v[9] == 0.0
        assert v[10] == pytest.approx(0.5)
        assert v[11] == 0.0
@pytest.mark.asyncio
 async def test_score_egreedy_v2_returns_candidate():
    payload = {
        "user_id": "v2-user",
        "candidates": [
            {"id": "t:a", "content": "A", "source": "todoist",
             "features": {"is_overdue": True, "task_age_days": 2, "priority": 3}},
            {"id": "t:b", "content": "B", "source": "todoist",
             "features": {"is_overdue": False, "task_age_days": 0, "priority": 1}},
        ],
        "context": {"hour_of_day": 9, "day_of_week": 1},
        "profile_features": {
            "completion_rate_30d": 0.4,
            "dismiss_rate_30d": 0.1,
            "mean_dwell_ms_30d": 45_000,
            "preferred_hour": 9,
            "tip_volume_30d": 8,
        },
    }
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        r = await client.post("/score/egreedy/v2", json=payload)
    assert r.status_code == 200
    body = r.json()
    assert body["tip_id"] in {"t:a", "t:b"}
    assert body["policy"] == "egreedy-v2"
@pytest.mark.asyncio
 async def test_score_egreedy_v2_accepts_missing_profile():
    payload = {
        "user_id": "v2-no-profile",
        "candidates": [
            {"id": "t:solo", "content": "Solo", "source": "todoist",
             "features": {"is_overdue": False, "task_age_days": 0, "priority": 1}},
        ],
        "context": {"hour_of_day": 10, "day_of_week": 0},
    }
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        r = await client.post("/score/egreedy/v2", json=payload)
    assert r.status_code == 200
    assert r.json()["tip_id"] == "t:solo"
@pytest.mark.asyncio
 async def test_reward_egreedy_v2_updates_stats():
    user_id = "v2-reward-stats"
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        r0 = await client.get(f"/stats/egreedy/v2/{user_id}")
        before = r0.json()["cumulative_reward"]
        await client.post("/reward/egreedy/v2", json={
            "user_id": user_id,
            "tip_id": "t:r",
            "reward": 1.0,
            "features": {"hour_of_day": 9, "is_overdue": True, "task_age_days": 2, "priority": 3},
            "day_of_week": 1,
            "profile_features": {
                "completion_rate_30d": 0.3,
                "dismiss_rate_30d": 0.2,
                "mean_dwell_ms_30d": 30_000,
                "preferred_hour": 9,
                "tip_volume_30d": 5,
            },
        })
        r1 = await client.get(f"/stats/egreedy/v2/{user_id}")
    body = r1.json()
    assert body["cumulative_reward"] == pytest.approx(before + 1.0)
    assert body["policy"] == "egreedy-v2"
    assert len(body["theta"]) == 12
    assert len(body["feature_labels"]) == 12
@pytest.mark.asyncio
 async def test_reset_clears_v2_state():
    user_id = "v2-reset"
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        await client.post("/score/egreedy/v2", json={
            "user_id": user_id,
            "candidates": [
                {"id": "t:v2r", "content": "x", "source": "todoist",
                 "features": {"is_overdue": False, "task_age_days": 0, "priority": 1}},
            ],
            "context": {"hour_of_day": 10, "day_of_week": 0},
        })
        r0 = await client.get(f"/stats/egreedy/v2/{user_id}")
        assert r0.json()["pulls"] >= 1
        await client.post(f"/reset/{user_id}")
        r1 = await client.get(f"/stats/egreedy/v2/{user_id}")
    assert r1.json()["pulls"] == 0
@pytest.mark.asyncio
 async def test_reward_negative_value():
    """Dismissing a tip should decrease cumulative_reward."""
    user_id = "dismiss-user-neg"
    async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
        r0 = await client.get(f"/stats/{user_id}")
        before = r0.json()["cumulative_reward"]
        await client.post("/reward", json={
            "user_id": user_id,
            "tip_id": "t:neg",
            "reward": -1.0,
            "features": {"hour_of_day": 20, "is_overdue": False, "task_age_days": 0, "priority": 1},
        })
        r1 = await client.get(f"/stats/{user_id}")
    assert r1.json()["cumulative_reward"] == pytest.approx(before - 1.0)
--- a/packages/shared-types/README.md
+++ b/packages/shared-types/README.md
@@ -0,0 +1,63 @@
 # @oo/shared-types
 Canonical contracts for all inter-module communication. Two surfaces:
 | Surface | Format | Location |
 |---------|--------|----------|
 | HTTP (sync) | OpenAPI / TypeScript interfaces | `src/http/` |
 | Events (async) | Protocol Buffers + TS interfaces | `src/events/`, `events/` |
 ## HTTP types
 Hand-written TypeScript interfaces generated from OpenAPI specs. Imported by
 `services/api`, `apps/web`, and `ml/serving` (Python hand-mirrors).
 | File | Types |
 |------|-------|
 | `src/http/tip.ts` | `TipCandidate`, `RecommendResponse`, `TipFeedback` |
 | `src/http/auth.ts` | `SessionUser` |
 | `src/http/integrations.ts` | `IntegrationsResponse`, `Integration` |
 | `src/http/user.ts` | `UserProfile` |
 | `src/http/signal.ts` | `Signal`, `SignalSource` |
 ## Event types
 Protobuf schemas live in `events/oo/events/v1/`. TypeScript interfaces in
 `src/events/index.ts` mirror the proto envelope and payload types.
 | Proto file | Messages |
 |------------|----------|
 | `envelope.proto` | `Envelope` (wraps every event) |
 | `signals.proto` | `TaskSyncedPayload`, `TipServedPayload`, `TipFeedbackPayload`, `TipRewardFailedPayload` |
 | `integration.proto` | `IntegrationTokenExpiredPayload` |
 **Schema evolution rules (ADR-0005):**
 - Additive changes only within a version (new fields, new message types).
 - Removed fields must be marked `reserved` — never reuse a field number.
 - Breaking changes require a new package version (`oo.events.v2`) and a `schemaVersion` bump in the envelope.
 ## Schema registry / CI gate
 `buf` enforces lint and breaking-change detection on every PR that touches `events/`:
 ```bash
 # Lint
 buf lint events/
 # Breaking-change check against main
 buf breaking events/ --against '.git#branch=main,subdir=packages/shared-types/events'
 ```
 Local shortcut: `./scripts/buf-check.sh`
 CI: `.gitea/workflows/buf-check.yaml` (requires a Gitea Actions runner).
 Install buf: `curl -sSfL https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64 -o /usr/local/bin/buf && chmod +x /usr/local/bin/buf`
 ## Contract
 `/health` — not applicable (library package, no process).
 **Extraction criteria** — always a shared library. Extract to a separate registry
 service only when schema governance requires independent versioning and deployment
 (e.g. external consumers, SLA divergence from the monorepo).
--- a/packages/shared-types/events/buf.yaml
+++ b/packages/shared-types/events/buf.yaml
@@ -0,0 +1,7 @@
 version: v1
 lint:
  use:
    - STANDARD
 breaking:
  use:
    - FILE
--- a/packages/shared-types/events/oo/events/v1/envelope.proto
+++ b/packages/shared-types/events/oo/events/v1/envelope.proto
@@ -0,0 +1,25 @@
 syntax = "proto3";
 package oo.events.v1;
 import "oo/events/v1/signals.proto";
 import "oo/events/v1/integration.proto";
 // Envelope wraps every event on the bus and on NATS JetStream.
 // Wire format: proto3 JSON (camelCase field names).
 // schema_version = "v1" — bump to "v2" only for breaking payload changes.
 message Envelope {
  string event_id       = 1;  // UUID assigned by bus on publish
  string occurred_at    = 2;  // ISO 8601
  string schema_version = 3;  // "v1"
  string producer       = 4;  // e.g. "services/api"
  string subject        = 5;  // NATS-style subject: domain.entity.verb
  uint64 seq            = 6;  // monotonic sequence from the bus ring
  oneof payload {
    TaskSyncedPayload              task_synced              = 10;
    TipServedPayload               tip_served               = 11;
    TipFeedbackPayload             tip_feedback             = 12;
    TipRewardFailedPayload         tip_reward_failed        = 13;
    IntegrationTokenExpiredPayload integration_token_expired = 14;
  }
 }
--- a/packages/shared-types/events/oo/events/v1/integration.proto
+++ b/packages/shared-types/events/oo/events/v1/integration.proto
@@ -0,0 +1,9 @@
 syntax = "proto3";
 package oo.events.v1;
 // subject: signals.integration.token_expired
 message IntegrationTokenExpiredPayload {
  string user_id     = 1;
  string provider    = 2;
  string detected_at = 3;  // ISO 8601
 }
--- a/packages/shared-types/events/oo/events/v1/signals.proto
+++ b/packages/shared-types/events/oo/events/v1/signals.proto
@@ -0,0 +1,39 @@
 syntax = "proto3";
 package oo.events.v1;
 // subject: signals.task.synced
 message TaskSyncedPayload {
  string user_id   = 1;
  string source    = 2;  // e.g. "todoist"
  int32  count     = 3;
  string synced_at = 4;  // ISO 8601
 }
 // subject: signals.tip.served
 message TipServedPayload {
  string user_id   = 1;
  string tip_id    = 2;
  string policy    = 3;
  string served_at = 4;  // ISO 8601
 }
 // subject: signals.tip.feedback
 // action: done | dismiss | snooze | helpful | not_helpful
 message TipFeedbackPayload {
  string         user_id    = 1;
  string         tip_id     = 2;
  string         action     = 3;
  double         reward     = 4;
  optional int64 dwell_ms   = 5;  // null when no dwell was recorded
  string         created_at = 6;  // ISO 8601
 }
 // subject: signals.tip.reward_failed
 message TipRewardFailedPayload {
  string user_id   = 1;
  string tip_id    = 2;
  double reward    = 3;
  int32  attempts  = 4;
  string error     = 5;
  string failed_at = 6;  // ISO 8601
 }
--- a/packages/shared-types/package.json
+++ b/packages/shared-types/package.json
@@ -15,7 +15,9 @@
    "test": "vitest run",
    "test:watch": "vitest",
    "type-check": "tsc --noEmit",
-    "clean": "rm -rf dist"
+    "clean": "rm -rf dist",
    "buf:lint": "buf lint events",
    "buf:breaking": "buf breaking events --against '.git#branch=main,subdir=packages/shared-types/events'"
  },
  "devDependencies": {
    "@vitest/coverage-v8": "^4.1.4",
--- a/packages/shared-types/src/events/index.ts
+++ b/packages/shared-types/src/events/index.ts
@@ -1,6 +1,6 @@
 /**
 * NormalizedEvent — the durable envelope for all events flowing through
- * the system. Today: in-process EventEmitter. Tomorrow: NATS JetStream.
+ * the system. Mirrors oo.events.v1.Envelope in packages/shared-types/events/.
 *
 * Subject taxonomy:
 *   signals.task.synced      — Todoist (or other source) task list refreshed
@@ -10,10 +10,16 @@
 *   signals.integration.token_expired — OAuth token needs reconnect
 */
 export interface NormalizedEvent<T = unknown> {
  /** UUID assigned by bus on publish */
  eventId: string;
  /** NATS-style subject: domain.entity.verb */
  subject: string;
  /** ISO 8601 timestamp */
-  ts: string;
+  occurredAt: string;
  /** "v1" — bump for breaking payload changes; see packages/shared-types/events/ */
  schemaVersion: 'v1';
  /** e.g. "services/api" */
  producer: string;
  /** Monotonically increasing sequence number (in-process ring; JetStream seq in prod) */
  seq: number;
  payload: T;
--- a/packages/shared-types/src/http/integrations.ts
+++ b/packages/shared-types/src/http/integrations.ts
@@ -1,4 +1,4 @@
-export type IntegrationProvider = 'todoist';
+export type IntegrationProvider = 'todoist' | 'google-health';
 export type IntegrationStatus = 'connected' | 'disconnected' | 'error';
 export interface Integration {
--- a/packages/shared-types/src/http/signal.ts
+++ b/packages/shared-types/src/http/signal.ts
@@ -2,7 +2,7 @@
 export interface Signal {
  id: string;
  source: string;                             // e.g. 'todoist', 'google-calendar', 'manual'
-  kind: 'task' | 'event' | 'habit' | 'insight';
+  kind: 'task' | 'event' | 'habit' | 'insight' | 'health';
  content: string;
  metadata: Record<string, unknown>;          // source-specific raw fields
  features: Record<string, number | boolean>; // bandit-ready numeric/boolean features
--- a/packages/shared-types/src/http/tip.ts
+++ b/packages/shared-types/src/http/tip.ts
@@ -2,7 +2,7 @@
 export type TipKind = 'task' | 'advice' | 'insight' | 'reminder';
 /** Where the tip content originated */
-export type TipSource = 'todoist' | 'llm' | 'advice';
+export type TipSource = 'todoist' | 'llm' | 'advice' | 'fallback';
 /** A single recommendation surfaced to the user */
 export interface Tip {
--- a/packages/shared-types/tsconfig.json
+++ b/packages/shared-types/tsconfig.json
@@ -4,5 +4,6 @@
    "outDir": "dist",
    "rootDir": "src"
  },
-  "include": ["src"]
+  "include": ["src"],
  "exclude": ["src/__tests__", "**/*.test.ts"]
 }
--- a/pnpm-lock.yaml
+++ b/pnpm-lock.yaml
--- a/scripts/buf-check.sh
+++ b/scripts/buf-check.sh
@@ -0,0 +1,24 @@
 #!/usr/bin/env bash
 # Run buf lint and breaking-change detection locally.
 # Usage: ./scripts/buf-check.sh [against-branch]
 # Default against-branch: main
 set -euo pipefail
 AGAINST="${1:-main}"
 ROOT="$(cd "$(dirname "$0")/.." && pwd)"
 EVENTS="$ROOT/packages/shared-types/events"
 if ! command -v buf &>/dev/null; then
  echo "buf not found. Install: https://buf.build/docs/installation"
  echo "  curl -sSfL https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64 -o /usr/local/bin/buf && chmod +x /usr/local/bin/buf"
  exit 1
 fi
 echo "==> buf lint"
 buf lint "$EVENTS"
 echo "==> buf breaking against $AGAINST"
 buf breaking "$EVENTS" \
  --against ".git#branch=${AGAINST},subdir=packages/shared-types/events"
 echo "All checks passed."
--- a/services/api/README.md
+++ b/services/api/README.md
@@ -0,0 +1,98 @@
 # services/api
 Express BFF that serves all client-facing routes, manages sessions, runs background signal sync, and proxies admin calls to `ml/serving`.
 ## Contract
 ```
 GET  /health                             { ok: true }
 POST /api/auth/login                     → redirect to Google OAuth
 GET  /api/auth/callback                  OAuth return URL
 POST /api/auth/logout
 GET  /api/auth/session                   → { user? }
 POST /api/auth/token                     { token } → set sid cookie (ADMIN_TOKEN auth)
 GET  /api/integrations                   list connected integrations
 POST /api/integrations/todoist/connect   start Todoist OAuth
 GET  /api/integrations/todoist/callback
 DELETE /api/integrations/:provider       disconnect
 POST /api/recommend                      → { tip }
 POST /api/tip/:id/feedback               { action } → { ok }
 GET  /api/user/profile
 DELETE /api/user                         account deletion
 POST /api/push/subscribe
 DELETE /api/push/subscribe
 GET  /api/admin/stats                    DAU/WAU, feedback breakdown
 GET  /api/admin/users                    user list with pagination
 GET  /api/user/:id                       user detail, consents, integrations
 GET  /api/admin/events                   recent event stream (ring buffer or NATS JetStream)
 GET  /api/admin/events/history           historical event query (time range, filters)
 GET  /api/admin/sim/runs                 offline sim run list
 POST /api/admin/sim/run                  launch offline sim with policy/judge params
 GET  /api/admin/sim/runs/:id/output      tail sim stdout
 GET  /api/admin/features/:userId         per-user profile features + freshness
 GET  /api/admin/features/:userId/context context features for last score call
 POST /api/admin/policies                 list shadow policies + active policy
 POST /api/admin/policies/:name/toggle    enable/disable shadow policy
 POST /api/admin/users/:id/actions        revoke-integration, reset-bandit, rebuild-profile
 GET  /api/admin/health                   system health: api, ml/serving, db, bus, mlflow
 GET  /api/admin/docs                     admin documentation index
 GET  /api/ml/*                           admin-only proxy to ml/serving
 ```
 ## Middleware stack (request order)
 1. `cors` — origin limited to `WEB_BASE_URL`
 2. `tracingMiddleware` — reads or generates W3C `traceparent`; sets `req.traceId` + `req.traceparent`
 3. `pinoHttp` — structured JSON request/response logs with `traceId` field; `/health` suppressed
 4. `express.json()` / `cookieParser`
 5. `sessionMiddleware` — validates `sid` cookie, attaches `req.userId`
 ## Observability
 Logs are structured JSON via **pino**. Every line includes `traceId` (extracted from the incoming W3C `traceparent` header, or generated fresh). The same `traceparent` is forwarded on all outbound HTTP calls to `ml/serving` so traces correlate end-to-end.
 Sentry error capture is active when `SENTRY_DSN` is set.
 ## Background tasks
 - **Todoist sync scheduler** — runs every `TODOIST_SYNC_INTERVAL_MS` (default 15 min); starts 10 s after boot to avoid startup surge.
 - **Retention purge** — deletes `tipScores` and `tipFeedback` rows older than 30 days; runs on boot and daily.
 - **Profile TTL invalidation** — listens to `signals.task.synced` and `signals.tip.feedback` on the in-process Bus; invalidates cached user-level profile features so the next `/recommend` gets fresh values.
 ## Config
 | Env var | Default | Description |
 |---------|---------|-------------|
 | `PORT` | `3001` | Listen port |
 | `NODE_ENV` | `development` | Environment label |
 | `DATABASE_PATH` | `./data/oo.db` | SQLite file |
 | `SESSION_SECRET` | required | Cookie signing secret |
 | `GOOGLE_CLIENT_ID/SECRET` | required | OAuth |
 | `TODOIST_CLIENT_ID/SECRET` | required | OAuth |
 | `API_BASE_URL` | `http://localhost:3001` | Self-referential redirect URI |
 | `WEB_BASE_URL` | `http://localhost:3000` | CORS + post-login redirect |
 | `ML_SERVING_URL` | `http://localhost:8000` | ml/serving base URL |
 | `NATS_URL` | `` | NATS broker; empty = in-process bus only |
 | `TODOIST_SYNC_INTERVAL_MS` | `900000` | Background sync cadence |
 | `TIP_PROMPT_VERSION` | `` | Prompt variant(s) for `/generate` |
 | `LOG_LEVEL` | `info` | pino log level |
 | `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
 | `VAPID_*` | | Web push keys |
 | `ADMIN_TOKEN` | `` | Static token for service/Playwright admin auth; empty = disabled |
 ## Health story
 `GET /health` returns `{ ok: true }`. No dependency checks — upstream deps (`ml/serving`, NATS) have their own health endpoints checked separately.
 ## Extraction criteria
 Extract to its own host when:
 - Auth session management needs a dedicated Redis/PG session store, **or**
 - Background sync load (Todoist, future connectors) displaces API serving on the shared host, **or**
 - Team boundary emerges between auth/BFF and recommender orchestration.
--- a/services/api/package.json
+++ b/services/api/package.json
@@ -16,6 +16,7 @@
  },
  "dependencies": {
    "@oo/shared-types": "workspace:*",
    "@sentry/node": "^10.50.0",
    "better-sqlite3": "^11.8.1",
    "cookie-parser": "^1.4.7",
    "cors": "^2.8.5",
@@ -27,6 +28,8 @@
    "nats": "^2.29.3",
    "node-fetch": "^3.3.2",
    "openid-client": "^6.3.4",
    "pino": "^10.3.1",
    "pino-http": "^11.0.0",
    "web-push": "^3.6.7",
    "zod": "^3.24.1"
  },
--- a/Show More
+++ b/Show More