docs(observability): add services/api README; update ml/serving + recommender docs (#18)

- services/api/README.md: new — contract, middleware stack, background
  tasks, config table (LOG_LEVEL, SENTRY_DSN), health story, extraction
  criteria
- ml/serving/README.md: add Observability section (structlog JSON,
  traceparent → trace_id binding), add SENTRY_DSN + ENV to config table
- services/recommender/README.md: fix policy table — egreedy-v2 is
  active (#99), egreedy-v1 is shadow

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-26 03:41:39 +00:00
parent c4960d0601
commit b554970032
3 changed files with 100 additions and 3 deletions

View File

@@ -47,6 +47,12 @@ On startup, `nats_consumer.py` registers two durable push consumers against NATS
**Disabled** when `NATS_URL` is unset (default in local dev without NATS). No import of `nats-py` occurs in that case. **Disabled** when `NATS_URL` is unset (default in local dev without NATS). No import of `nats-py` occurs in that case.
## Observability
Logs are structured JSON via **structlog**. Every line includes `level`, `logger`, `timestamp`, and — when a W3C `traceparent` header is present on the incoming request — `trace_id` bound via Python `contextvars`, so all log lines within a request carry the same trace ID as the upstream API call.
Sentry error capture is active when `SENTRY_DSN` is set.
## Config ## Config
| Env var | Default | Description | | Env var | Default | Description |
@@ -58,6 +64,8 @@ On startup, `nats_consumer.py` registers two durable push consumers against NATS
| `NATS_DURABLE_PREFIX` | `feature-pipeline` | Prefix for durable consumer names | | `NATS_DURABLE_PREFIX` | `feature-pipeline` | Prefix for durable consumer names |
| `NATS_MAX_DELIVER` | `5` | Max redelivery attempts before dropping | | `NATS_MAX_DELIVER` | `5` | Max redelivery attempts before dropping |
| `DEFAULT_PROMPT_VERSION` | `v1` | Fallback prompt version for `/generate` | | `DEFAULT_PROMPT_VERSION` | `v1` | Fallback prompt version for `/generate` |
| `ENV` | `development` | Environment label (passed to Sentry) |
| `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
## Health story ## Health story

89
services/api/README.md Normal file
View File

@@ -0,0 +1,89 @@
# services/api
Express BFF that serves all client-facing routes, manages sessions, runs background signal sync, and proxies admin calls to `ml/serving`.
## Contract
```
GET /health { ok: true }
POST /api/auth/login → redirect to Google OAuth
GET /api/auth/callback OAuth return URL
POST /api/auth/logout
GET /api/auth/session → { user? }
GET /api/integrations list connected integrations
POST /api/integrations/todoist/connect start Todoist OAuth
GET /api/integrations/todoist/callback
DELETE /api/integrations/:provider disconnect
POST /api/recommend → { tip }
POST /api/tip/:id/feedback { action } → { ok }
GET /api/user/profile
DELETE /api/user account deletion
POST /api/push/subscribe
DELETE /api/push/subscribe
GET /api/admin/stats DAU/WAU, feedback breakdown
GET /api/admin/users
GET /api/admin/events recent event stream (ring buffer)
GET /api/admin/sim/runs offline sim run list
POST /api/admin/sim/run launch offline sim
GET /api/admin/sim/runs/:id/output tail sim stdout
...
GET /api/ml/* admin-only proxy to ml/serving
```
## Middleware stack (request order)
1. `cors` — origin limited to `WEB_BASE_URL`
2. `tracingMiddleware` — reads or generates W3C `traceparent`; sets `req.traceId` + `req.traceparent`
3. `pinoHttp` — structured JSON request/response logs with `traceId` field; `/health` suppressed
4. `express.json()` / `cookieParser`
5. `sessionMiddleware` — validates `sid` cookie, attaches `req.userId`
## Observability
Logs are structured JSON via **pino**. Every line includes `traceId` (extracted from the incoming W3C `traceparent` header, or generated fresh). The same `traceparent` is forwarded on all outbound HTTP calls to `ml/serving` so traces correlate end-to-end.
Sentry error capture is active when `SENTRY_DSN` is set.
## Background tasks
- **Todoist sync scheduler** — runs every `TODOIST_SYNC_INTERVAL_MS` (default 15 min); starts 10 s after boot to avoid startup surge.
- **Retention purge** — deletes `tipScores` and `tipFeedback` rows older than 30 days; runs on boot and daily.
- **Profile TTL invalidation** — listens to `signals.task.synced` and `signals.tip.feedback` on the in-process Bus; invalidates cached user-level profile features so the next `/recommend` gets fresh values.
## Config
| Env var | Default | Description |
|---------|---------|-------------|
| `PORT` | `3001` | Listen port |
| `NODE_ENV` | `development` | Environment label |
| `DATABASE_PATH` | `./data/oo.db` | SQLite file |
| `SESSION_SECRET` | required | Cookie signing secret |
| `GOOGLE_CLIENT_ID/SECRET` | required | OAuth |
| `TODOIST_CLIENT_ID/SECRET` | required | OAuth |
| `API_BASE_URL` | `http://localhost:3001` | Self-referential redirect URI |
| `WEB_BASE_URL` | `http://localhost:3000` | CORS + post-login redirect |
| `ML_SERVING_URL` | `http://localhost:8000` | ml/serving base URL |
| `NATS_URL` | `` | NATS broker; empty = in-process bus only |
| `TODOIST_SYNC_INTERVAL_MS` | `900000` | Background sync cadence |
| `TIP_PROMPT_VERSION` | `` | Prompt variant(s) for `/generate` |
| `LOG_LEVEL` | `info` | pino log level |
| `SENTRY_DSN` | `` | Sentry DSN; empty = Sentry disabled |
| `VAPID_*` | | Web push keys |
## Health story
`GET /health` returns `{ ok: true }`. No dependency checks — upstream deps (`ml/serving`, NATS) have their own health endpoints checked separately.
## Extraction criteria
Extract to its own host when:
- Auth session management needs a dedicated Redis/PG session store, **or**
- Background sync load (Todoist, future connectors) displaces API serving on the shared host, **or**
- Team boundary emerges between auth/BFF and recommender orchestration.

View File

@@ -31,9 +31,9 @@ Signals carry `features: Record<string, number | boolean>` (bandit-ready) and `m
| Policy | Status | Notes | | Policy | Status | Notes |
|--------|--------|-------| |--------|--------|-------|
| `random` | Shadow | Fallback when ml/serving unreachable | | `random` | Fallback | Used when ml/serving is unreachable |
| `egreedy-v1` | **Active** | d=7, ADR-0007 | | `egreedy-v1` | Shadow | d=7, ADR-0007 |
| `egreedy-v2` | Shadow | d=12 + profile features, ADR-0012 | | `egreedy-v2` | **Active** | d=12 + profile features, ADR-0012 |
Shadow → active promotion requires offline sim + online agreement (ADR-0002). Shadow → active promotion requires offline sim + online agreement (ADR-0002).