refactor: architecture revision — modular monolith, auth-commit, event protobuf, privacy-from-day-0

- ADR-0003: modular monolith for Phase 0 with documented extraction triggers - ADR-0004: Auth.js + OIDC-shaped boundary; dedicated provider when mobile ships - ADR-0005: protobuf for events, OpenAPI for HTTP, schema-registry CI gate - New architecture docs: data-model, metrics (magic proxies), privacy (Phase-0 feature) - Prime directives updated: privacy-as-feature, modular-by-package-deployable-by-stage - Roadmap revised: Apple OAuth deferred to M1; web push in M1; k3s intermediate; tip-kind-aware UI - PLAN updated: Phase-0 deletion endpoint, metrics baseline, compose profiles, import-boundary lint - License decision in README (ARR with OSS plan in Phase 5)
2026-04-13 14:36:11 +00:00
parent cf4c7a0eb4
commit 7f173f88d3
13 changed files with 449 additions and 133 deletions
--- a/docs/adr/0003-modular-monolith-phase0.md
+++ b/docs/adr/0003-modular-monolith-phase0.md
@@ -0,0 +1,31 @@
+# ADR-0003: Modular monolith for Phase 0, extract when justified
+
+## Status
+Accepted — 2026-04-13
+
+## Context
+The initial architecture called for seven independently-deployable services on day one (gateway, auth, profile, integrations, recommender, events, notifier). For a team of ~3 streams with zero users, this is premature. Each service adds CI, deploy, DB, observability, and release-coordination overhead. It also slows the walking skeleton, which is the most important thing to ship.
+
+Modularity — the thing we actually need — is a **code-boundary** property, not a **process-boundary** property. Well-bounded packages extract to services cheaply; poorly-bounded services rarely merge back.
+
+## Decision
+- **Phase 0:** one Node process bundles `services/*` as internal packages behind their HTTP contracts. `ml/serving` is a separate Python process (language boundary). Postgres + NATS complete the stack.
+- **Directory layout** under `services/` is unchanged. Each module is a self-contained package with its own README, schema migrations, and public interface.
+- **Communication** between modules goes through the same HTTP or event contracts it will use post-extraction. In Phase 0 these are resolved in-process via a thin dispatcher; swapping to HTTP/NATS is a transport change, not an API change.
+- **Extraction criteria** (trigger a service split when any apply):
+  1. Language boundary (already true for `ml/serving`).
+  2. Scaling hotspot: the module's load curve diverges materially from the rest.
+  3. SLA divergence: the module needs stricter availability or latency than the monolith.
+  4. Team ownership: a dedicated team takes the module and wants independent releases.
+  5. Regulatory isolation: credentials/PII need tighter blast-radius control.
+- **`events/` is special:** even inside the monolith we use an event-emitter abstraction whose production implementation is NATS JetStream. The async boundary matters for ML correctness; the process boundary doesn't.
+
+## Consequences
+- Faster Phase 0: one CI pipeline, one deploy, one observability config.
+- Cheap extraction: contracts are already HTTP/event-shaped.
+- Discipline required: no cross-module DB access, no reaching into another module's internals, even though it's physically possible. Enforced by lint/import rules.
+- Deploy story: docker-compose with two application containers (Node monolith + Python serving) until extraction begins. Compose profiles let devs bring up subsets.
+
+## Non-consequences
+- We are **not** monolith-forever. We fully expect `integrations/` and `recommender/` to extract once Phase 2+ traffic patterns justify it.
+- Frontend / mobile unaffected.
--- a/docs/adr/0004-auth-authjs-with-oidc-boundary.md
+++ b/docs/adr/0004-auth-authjs-with-oidc-boundary.md
@@ -0,0 +1,23 @@
+# ADR-0004: Auth.js for Phase 0, dedicated OIDC provider when mobile ships
+
+## Status
+Accepted — 2026-04-13
+
+## Context
+We need Google (and later Apple) sign-in, session management, and JWTs other services can verify. Options considered:
+- **Auth.js (NextAuth):** a library embedded in the Next.js web app. Fastest to ship. Tight coupling to the web runtime; awkward when a native mobile client also needs tokens.
+- **Ory Kratos + Hydra:** a standalone, self-hosted identity + OIDC provider. Much more powerful. Operationally heavy for a prototype.
+- **Roll our own:** not considered.
+
+Mobile apps are Phase 3+. Phase 0 needs the cheapest credible option that does not box us in.
+
+## Decision
+- **Phase 0:** use **Auth.js** inside the web app. Google provider only (Apple deferred — paid dev account + extra domain setup).
+- **Boundary:** from day one, the `auth` module exposes an **OIDC-shaped** HTTP surface (`/me`, `/logout`, JWT verification via public JWKS, `/.well-known/openid-configuration` stub). Other services verify JWTs against that surface, not against Auth.js internals. This means the day we replace the engine, only one module changes.
+- **JWT strategy:** short-lived (10 min) access JWT, rotating refresh token in an HttpOnly cookie. JWT contains `sub`, `email`, `scope`, `sid`.
+- **Trigger to migrate to Ory (or equivalent):** any of — (a) native mobile shipping, (b) a second client type that can't piggyback on Next.js sessions, (c) multi-tenant requirement.
+
+## Consequences
+- Ships in days, not weeks.
+- The OIDC-shaped boundary means the migration is scoped, not scary.
+- Slight duplication early: we maintain OIDC-surface code that Auth.js mostly handles internally. Worth it.
--- a/docs/adr/0005-event-schemas-protobuf.md
+++ b/docs/adr/0005-event-schemas-protobuf.md
@@ -0,0 +1,28 @@
+# ADR-0005: Protocol Buffers for event schemas, OpenAPI for HTTP
+
+## Status
+Accepted — 2026-04-13
+
+## Context
+Two contract surfaces exist:
+1. **HTTP** — synchronous, client ↔ server, human-readable debugging matters. OpenAPI is the default and generates decent TS clients.
+2. **Events** — durable, fan-out to ML consumers, schema evolution critical. Feature pipelines trained on old schemas will silently misbehave when producers change a field.
+
+Using OpenAPI for both means:
+- Python pydantic generation is awkward and hand-maintained in practice.
+- No wire-format discipline (JSON is loose).
+- No central schema registry, so schema drift is undetected until a model regresses.
+
+## Decision
+- **HTTP** contracts: OpenAPI 3.1 in `packages/shared-types/http/`. Generate TS clients; hand-write Python pydantic models for ML consumers (few, and they're shallow).
+- **Event** contracts: Protocol Buffers in `packages/shared-types/events/`. Generate TS and Python. All events carry an envelope: `{event_id, occurred_at, schema_version, producer, payload}`.
+- **Schema registry:** lightweight self-hosted (buf.build Schema Registry OSS or a tiny registry in `events/`). CI check blocks breaking changes without a version bump.
+- **Evolution rules:** additive only within a major version; `reserved` for removed fields; new `schema_version` for breaking changes; consumers advertise the versions they accept.
+
+## Consequences
+- One extra build step in `shared-types` (buf or protoc).
+- Breaking event changes cost something — good; they should.
+- ML pipelines can replay old events against new code with confidence.
+
+## Non-consequences
+- No gRPC. HTTP stays HTTP/JSON. Protobuf is only the wire format on the event bus.
--- a/docs/architecture/data-model.md
+++ b/docs/architecture/data-model.md
@@ -0,0 +1,87 @@
+# Data model
+
+Durable entities across modules. Per-module databases/schemas own these; cross-module access is only via the module's API.
+
+## Core entities
+
+```
+User                 auth + profile
+  id (uuid)
+  created_at
+  email                        (from IdP)
+  preferred_name?
+  deleted_at?                  soft-delete for 30-day recovery; hard-delete after
+
+IdentityLink         auth
+  user_id
+  provider                     "google" | "apple"
+  provider_sub                 subject from IdP
+  created_at
+
+Session              auth
+  user_id
+  sid (uuid)                   in JWT
+  issued_at
+  expires_at
+  revoked_at?
+
+Profile              profile
+  user_id (pk)
+  timezone
+  quiet_hours                  jsonb: [{start,end,days}]
+  contexts                     jsonb: [{name,predicate}]      introduced in Phase 2
+  consents                     jsonb: {integration: {read,write,retain_days}}
+
+Credential           integrations
+  user_id
+  provider                     "todoist" | "google_calendar" | ...
+  ciphertext                   sealed-box over {access, refresh, scopes, expires_at}
+  meta                         provider-specific (sync_token cursor for Todoist)
+  created_at
+  last_refreshed_at
+  revoked_at?
+
+Event                events
+  event_id (ulid)
+  user_id
+  schema_version
+  kind                         e.g. "signals.task.updated"
+  occurred_at
+  ingested_at
+  payload                      protobuf bytes
+
+TipInstance          recommender
+  tip_id (ulid)
+  user_id
+  policy_name                  "random" | "bandit.linucb" | "remote:v3"
+  policy_version
+  candidate_source             "todoist" | "advice.library" | ...
+  context_snapshot             jsonb: features seen at decision time
+  tip                          jsonb: {kind,title,body,source,deep_link,meta}
+  created_at
+  shown_at?                    set when the client reports render
+  reaction?                    "done" | "snooze" | "dismiss" | null
+  reacted_at?
+  delivery_id?                 fk if surfaced via notifier push
+
+Delivery             notifier
+  delivery_id
+  user_id
+  tip_id
+  channel                      "webpush" | "apns" | "fcm" | "email"
+  dispatched_at
+  delivered_at?
+  failure_reason?
+```
+
+## Foreign-key discipline
+
+There are no cross-module FKs. Each module owns its tables. References by id are soft; consistency is maintained by events (user-deleted → every module cascades its own cleanup).
+
+## Deletion
+
+`User.deleted_at` set → a `user.deletion_requested` event goes out → each module soft-deletes its rows → after 30 days a scheduled job hard-deletes. Credentials are **revoked at the provider** (not just erased locally) on soft-delete. See `privacy.md`.
+
+## Replay and reproducibility
+
+`TipInstance.context_snapshot` captures the exact features that produced the decision. This is what lets offline replay re-score historical tips against a new policy without touching the feature store.
--- a/docs/architecture/metrics.md
+++ b/docs/architecture/metrics.md
@@ -0,0 +1,43 @@
+# Metrics: measuring "magic"
+
+We cannot build a product whose core promise is "feels like magic" without proxies for it. These are the metrics every change is measured against.
+
+## North star
+
+**Week-2 tip-reaction rate** — of users who saw a tip in week 1, what fraction reacted to *any* tip in week 2? Captures "did this become part of your life."
+
+## Activation (single-session)
+
+- **Time-to-first-tip** — sign-in → tip rendered. Target: ≤ 60 s on the happy path.
+- **First-tip reaction rate** — fraction of users who interact (done/snooze/dismiss/save) with their very first tip. Target: > 50%.
+
+## Engagement
+
+- **Dwell-before-action** — seconds between tip render and first reaction. Too short = glance-away; too long = confused.
+- **Done rate / (Done + Snooze + Dismiss)** — the quality proxy. Rising = tips feel on-target.
+- **Snooze:Dismiss ratio** — high snooze = "good tip, wrong moment" (timing problem). High dismiss = "wrong tip entirely" (relevance problem). These point at different fixes.
+- **Return cadence** — median inter-session gap. Stable-and-short > spiky.
+
+## Retention
+
+- D1, D7, D28 retention. Cohort-sliced by connected integrations.
+- Churn signal: 7 days without a session.
+
+## ML health (from M1)
+
+- Policy latency p50/p95/p99 at the recommender boundary.
+- Feature null-rate per feature, per user.
+- Online/offline reward disagreement for shadowed policies.
+- Bandit regret proxy: observed reward vs an oracle's best-possible on the same candidates.
+
+## Privacy & trust
+
+- Account-deletion completion time (target: < 24 h).
+- Provider-revocation success rate on disconnect.
+- Number of active credentials per user (low = healthy).
+
+## How metrics become decisions
+
+- **Per-change.** Any policy or UX change declares which metric it expects to move and by how much. Missing the target triggers a review, not an automatic rollback (humans judge).
+- **Shadow > A/B > launch.** Policy changes ship in shadow first (log what it *would* have recommended); then A/B on live traffic; then launch once online reward estimate ≥ incumbent by a CI margin.
+- **Dashboards before features.** If we cannot measure a feature's impact on the north-star metric, we defer the feature.
--- a/docs/architecture/overview.md
+++ b/docs/architecture/overview.md
@@ -3,22 +3,25 @@
 ## Guiding constraints

 - The **recommendation decision** is the hot path. Every architectural choice should shorten the distance between a new signal and a better tip.
- Services are small and independently deployable, but we do **not** multiply services for its own sake. Split by team-of-ownership and by data lifecycle.
- Python for ML, TypeScript for applications, shared contracts regenerated from a single source of truth.
+- Modularity lives in **code boundaries**. Deploy topology follows pressure, not anticipation (ADR-0003).
+- Python for ML, TypeScript for applications. Shared contracts regenerated from a single source of truth: OpenAPI for HTTP, protobuf for events (ADR-0005).
+- Privacy is a Phase-0 feature, not a Phase-5 compliance project (see `privacy.md`).

-## Services
+## Modules

-| Service | Language | Responsibility | Owns data |
-|---|---|---|---|
-| `gateway` | TS (Node) | BFF for web/mobile; auth-checking; request fan-out | — |
-| `auth` | TS | OAuth (Google, Apple), sessions, token issuance | identities, sessions |
-| `profile` | TS | user profile, preferences, consents | profiles |
-| `integrations` | TS | third-party connectors, token vault, signal fetch | credentials, cursors |
-| `events` | TS | event-bus ingress, normalization, durable log | signal store |
-| `recommender` | TS | orchestration: candidates → policy → tip; feedback sink | tip history |
-| `ml/serving` | Python | online scoring for policies/models | — (stateless) |
-| `ml/pipelines` | Python | batch feature + training pipelines | feature store, models |
-| `notifier` | TS | push/email delivery, quiet hours, dedupe | delivery log |
+| Module | Language | Responsibility | Owns data | Phase-0 process |
+|---|---|---|---|---|
+| `gateway` | TS | BFF for web/mobile; auth-check; fan-out | — | Node monolith |
+| `auth` | TS | OAuth (Google; Apple in M1), sessions, JWT | identities, sessions | Node monolith |
+| `profile` | TS | user profile, preferences, consents | profiles | Node monolith |
+| `integrations` | TS | third-party connectors, token vault, signal fetch | credentials, cursors | Node monolith |
+| `events` | TS | event-bus abstraction + durable log (M1) | signal store | Node monolith (in-proc emitter) |
+| `recommender` | TS | orchestration: candidates → policy → tip; feedback sink | tip history | Node monolith |
+| `notifier` | TS | push/email delivery, quiet hours, dedupe | delivery log | Node monolith (web push in M1) |
+| `ml/serving` | Python | online scoring for policies/models | — (stateless) | **separate process** |
+| `ml/pipelines` | Python | batch feature + training pipelines | feature store, models | separate (from M4) |
+
+Extraction from the monolith is triggered by language boundary, scaling hotspot, SLA divergence, team ownership, or regulatory isolation (ADR-0003). `ml/serving` is pre-extracted on language grounds.

 ## Data boundaries

@@ -36,9 +39,28 @@ User reactions (done / snooze / dismiss) are events too. They close the loop as

 ## Why these choices

- **NATS JetStream** over Kafka for Phase 1: lighter, single-binary, fits the "one VM" deployment. Swap to Kafka in Phase 4.
- **Postgres** everywhere for OLTP. Per-service schemas, not per-service instances in dev.
+- **Modular monolith + Python ML** in Phase 0 to ship the walking skeleton fast without foreclosing decomposition (ADR-0003).
+- **NATS JetStream** over Kafka for Phase 1: lighter, single-binary, fits the "one VM" deployment. Swap to Kafka in Phase 4 if fan-out justifies it.
+- **Postgres** for OLTP; per-module schemas in dev; separate databases once modules extract.
 - **FastAPI + Pydantic** for ML serving — fast, typed, swappable runtime (ONNX, Triton) behind it.
+- **Protobuf** for event schemas with a schema registry (ADR-0005) — train/serve parity depends on this.
+- **OpenAPI** for HTTP; TS client auto-generated; Python pydantic hand-written while consumers are few.
 - **Feast** for feature store when we get there; homegrown adapter until then (Phase 1 seam).
 - **MLflow** for model registry; artifacts in MinIO/S3.
- **Auth.js or Ory** for identity — we will not write crypto.
+- **Auth.js** embedded behind an OIDC-shaped boundary (ADR-0004). Swap to a standalone OIDC provider when mobile ships.
+- **k3s** as the first step beyond docker-compose — no "compose → full k8s" cliff.
+
+## Decision flow for a new tip
+
+```
+client ─► gateway ─► recommender
+                       │
+                       ├─► candidates:   integrations.fetchCandidates(user)  + advice.library
+                       ├─► context:      FeatureAssembler(user, request)
+                       ├─► policy:       PolicyRegistry.get(policyName).pick(candidates, context)
+                       ├─► shadows:      run shadow policies in parallel, log their picks
+                       └─► persist:      TipInstance{context_snapshot, policy, tip}
+                       ◄─  tip
+```
+
+Feedback travels back the same path: `POST /feedback → events.emit(feedback.reaction)` → pipelines consume → bandit/model updated on next retrain.
--- a/docs/architecture/privacy.md
+++ b/docs/architecture/privacy.md
@@ -0,0 +1,40 @@
+# Privacy architecture
+
+Privacy is a Phase 0 feature, not a Phase 5 compliance project. This doc is the minimum.
+
+## Principles
+
+1. **Data minimization.** Store only what we need for the tip. Raw task titles stay at Todoist; we store references + computed features. If a feature doesn't lift a metric, its input data doesn't get stored.
+2. **User-visible controls.** Every connection shows exactly which scopes we hold and what we've computed. One tap disconnects and revokes.
+3. **Deletion is real.** Deleting an account revokes provider tokens, purges credentials immediately, and soft-deletes user data for a 30-day recovery window, then hard-deletes.
+4. **No surprise sharing.** Cross-user / collaborative features are opt-in, per category, per integration.
+5. **Encryption in transit and at rest.** TLS everywhere; column-level encryption for credentials; disk-level for backups.
+
+## Flows
+
+### Connect
+User taps "Connect Todoist" → consent screen lists: scopes requested, what we store, what we compute, retention, revocation instructions → OAuth → stored credential is immediately testable and shows in `/connect`.
+
+### Disconnect
+User taps disconnect → `Credential.revoked_at` set → provider-side revocation attempted (Todoist: token revocation endpoint) → credential erased on success → `credential.revoked` event → downstream modules drop associated cursors, caches, derived features for that `(user, provider)` pair.
+
+### Delete account
+User taps "Delete account" in settings → hard confirm → `User.deleted_at` set, all sessions revoked, `user.deletion_requested` event fanned out → every module processes its portion (credentials revoked + purged; profile scrubbed; tip history anonymized to aggregate stats only or purged, per retention policy; events purged on schedule) → within 24 hours account is non-recoverable operationally; within 30 days all rows are hard-deleted.
+
+### Export (Phase 2)
+`GET /me/export` returns a JSON bundle of everything we hold for the user: profile, consents, credentials-metadata (not secrets), events, tip history.
+
+## Scope boundaries
+
+Each integration declares the scopes it requests and the features it derives. The `Profile.consents` column is the source of truth; a scope removed from consent short-circuits derived-feature computation at the feature store.
+
+## Audit
+
+- Privileged actions (admin-initiated deletions, credential decryption outside the normal refresh path) go to an append-only audit log from Phase 0.
+- Per-user access log available via `GET /me/access-log` (Phase 2).
+
+## Legal surface (Phase 0 minimum)
+
+- Terms of Service + Privacy Policy documents shipped alongside the sign-in page.
+- Consent capture on first sign-in, with a versioned ToS/PP hash stored per user.
+- Data-subject request inbox (email) wired up before onboarding the first external user.