Files
oO/docs/adr/0010-nats-bridge-and-background-sync.md
alvis 5b52c6bf40 test: cover NATS bridge + Todoist scheduler; ADR-0010
- bus.test.ts: 4 cases for the new onPublish hook contract
- nats.test.ts: stream creation idempotency + JSON publish bridge
- scheduler.test.ts: startup delay, fan-out, per-user failure isolation
- ADR-0010 documents the bridge-don't-replace decision and the
  Todoist scheduler isolation, plus open follow-ups (#98 ml/serving
  consumer, #54 protobuf migration, graceful shutdown, metrics)
- README/overview/services README reflect the bridged event substrate
- CLAUDE.md gains a "don't nats.publish() directly" rule
- .env.example documents NATS_URL + TODOIST_SYNC_INTERVAL_MS

Verified in deployment 2026-04-18: api -> nats bridge connects on
boot, signals + feedback streams created, scheduler tick logs
"todoist sync: 1 ok, 0 failed (1 users)" within 10s. Closes #21, #22.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 07:55:25 +00:00

2.8 KiB

ADR-0010: NATS bridge over the in-process bus, and Todoist background sync

Status

Accepted — 2026-04-18

Context

ADR-0005 set protobuf + JetStream as the long-term event substrate. M1 shipped an in-process EventEmitter-based bus with the right subjects (signals.*, feedback.*) so the swap would be mechanical.

Two pressures pulled forward:

  1. ml/serving and future feature pipelines need to consume signals across process boundaries — the in-proc emitter cannot do that.
  2. Todoist signals were only fetched on the recommend path. Cold-cache hits added latency and a single 401/429 stalled the request that triggered it.

Decision

1. Bridge, do not replace

The Bus stays the producer. A new Bus.onPublish(hook) hook fires on every publish. When NATS_URL is set, connectNats() registers a hook that JSON-encodes the payload and js.publish(subject, data)s it to JetStream.

  • Streams are created on startup and are idempotent: signals (signals.>, 7-day file storage, 500k msgs) and feedback (feedback.>, 30-day, 200k).
  • JetStream publish errors are caught inside the hook so an unhealthy broker cannot crash the in-process publisher or its subscribers.
  • When NATS_URL is unset, connectNats is a no-op — local dev keeps working.

This preserves the existing bus.subscribe() contract for in-process consumers (reward inference, ring-buffer tail for the admin event viewer) while making events durably consumable across processes.

2. Schedule Todoist, keep on-demand as the SLA fallback

A 15-minute background scheduler (TODOIST_SYNC_INTERVAL_MS) walks every user with tokenStatus = 'active' and calls todoistSource.fetchSignals(uid), which in turn emits signals.task.synced. The per-request fetch in recommender stays — when the cache is colder than 30 s it still goes to Todoist inline, so freshness on the user's first hit of the day is unchanged.

Per-user failures are isolated with Promise.allSettled; one expired token cannot stop the rest of the cohort. The whole tick is wrapped so a transient SQLite error logs and skips, never crashes the API.

Consequences

  • ml/serving (and any future Python consumer) can durably tail signals.task.synced, signals.tip.served, signals.tip.feedback from JetStream without coupling to the API process.
  • Local dev still runs without NATS; the bridge is opt-in via env.
  • Wire format is JSON today (envelope per ADR-0005 not enforced yet) — see Open follow-ups.

Open follow-ups

  • A ml/serving JetStream consumer for the feature pipeline (today nothing reads from JetStream — the API only writes).
  • Move the wire payload to the protobuf envelope from ADR-0005 once the schema-registry CI gate (#54) lands.
  • Graceful shutdown of the scheduler timer on SIGTERM.
  • Per-publish failure metrics exported to the admin health view.