refactor: architecture revision — modular monolith, auth-commit, event protobuf, privacy-from-day-0

- ADR-0003: modular monolith for Phase 0 with documented extraction triggers - ADR-0004: Auth.js + OIDC-shaped boundary; dedicated provider when mobile ships - ADR-0005: protobuf for events, OpenAPI for HTTP, schema-registry CI gate - New architecture docs: data-model, metrics (magic proxies), privacy (Phase-0 feature) - Prime directives updated: privacy-as-feature, modular-by-package-deployable-by-stage - Roadmap revised: Apple OAuth deferred to M1; web push in M1; k3s intermediate; tip-kind-aware UI - PLAN updated: Phase-0 deletion endpoint, metrics baseline, compose profiles, import-boundary lint - License decision in README (ARR with OSS plan in Phase 5)
2026-04-13 14:36:11 +00:00
parent cf4c7a0eb4
commit 7f173f88d3
13 changed files with 449 additions and 133 deletions
--- a/docs/adr/0003-modular-monolith-phase0.md
+++ b/docs/adr/0003-modular-monolith-phase0.md
@@ -0,0 +1,31 @@
+# ADR-0003: Modular monolith for Phase 0, extract when justified
+
+## Status
+Accepted — 2026-04-13
+
+## Context
+The initial architecture called for seven independently-deployable services on day one (gateway, auth, profile, integrations, recommender, events, notifier). For a team of ~3 streams with zero users, this is premature. Each service adds CI, deploy, DB, observability, and release-coordination overhead. It also slows the walking skeleton, which is the most important thing to ship.
+
+Modularity — the thing we actually need — is a **code-boundary** property, not a **process-boundary** property. Well-bounded packages extract to services cheaply; poorly-bounded services rarely merge back.
+
+## Decision
+- **Phase 0:** one Node process bundles `services/*` as internal packages behind their HTTP contracts. `ml/serving` is a separate Python process (language boundary). Postgres + NATS complete the stack.
+- **Directory layout** under `services/` is unchanged. Each module is a self-contained package with its own README, schema migrations, and public interface.
+- **Communication** between modules goes through the same HTTP or event contracts it will use post-extraction. In Phase 0 these are resolved in-process via a thin dispatcher; swapping to HTTP/NATS is a transport change, not an API change.
+- **Extraction criteria** (trigger a service split when any apply):
+  1. Language boundary (already true for `ml/serving`).
+  2. Scaling hotspot: the module's load curve diverges materially from the rest.
+  3. SLA divergence: the module needs stricter availability or latency than the monolith.
+  4. Team ownership: a dedicated team takes the module and wants independent releases.
+  5. Regulatory isolation: credentials/PII need tighter blast-radius control.
+- **`events/` is special:** even inside the monolith we use an event-emitter abstraction whose production implementation is NATS JetStream. The async boundary matters for ML correctness; the process boundary doesn't.
+
+## Consequences
+- Faster Phase 0: one CI pipeline, one deploy, one observability config.
+- Cheap extraction: contracts are already HTTP/event-shaped.
+- Discipline required: no cross-module DB access, no reaching into another module's internals, even though it's physically possible. Enforced by lint/import rules.
+- Deploy story: docker-compose with two application containers (Node monolith + Python serving) until extraction begins. Compose profiles let devs bring up subsets.
+
+## Non-consequences
+- We are **not** monolith-forever. We fully expect `integrations/` and `recommender/` to extract once Phase 2+ traffic patterns justify it.
+- Frontend / mobile unaffected.
--- a/docs/adr/0004-auth-authjs-with-oidc-boundary.md
+++ b/docs/adr/0004-auth-authjs-with-oidc-boundary.md
@@ -0,0 +1,23 @@
+# ADR-0004: Auth.js for Phase 0, dedicated OIDC provider when mobile ships
+
+## Status
+Accepted — 2026-04-13
+
+## Context
+We need Google (and later Apple) sign-in, session management, and JWTs other services can verify. Options considered:
+- **Auth.js (NextAuth):** a library embedded in the Next.js web app. Fastest to ship. Tight coupling to the web runtime; awkward when a native mobile client also needs tokens.
+- **Ory Kratos + Hydra:** a standalone, self-hosted identity + OIDC provider. Much more powerful. Operationally heavy for a prototype.
+- **Roll our own:** not considered.
+
+Mobile apps are Phase 3+. Phase 0 needs the cheapest credible option that does not box us in.
+
+## Decision
+- **Phase 0:** use **Auth.js** inside the web app. Google provider only (Apple deferred — paid dev account + extra domain setup).
+- **Boundary:** from day one, the `auth` module exposes an **OIDC-shaped** HTTP surface (`/me`, `/logout`, JWT verification via public JWKS, `/.well-known/openid-configuration` stub). Other services verify JWTs against that surface, not against Auth.js internals. This means the day we replace the engine, only one module changes.
+- **JWT strategy:** short-lived (10 min) access JWT, rotating refresh token in an HttpOnly cookie. JWT contains `sub`, `email`, `scope`, `sid`.
+- **Trigger to migrate to Ory (or equivalent):** any of — (a) native mobile shipping, (b) a second client type that can't piggyback on Next.js sessions, (c) multi-tenant requirement.
+
+## Consequences
+- Ships in days, not weeks.
+- The OIDC-shaped boundary means the migration is scoped, not scary.
+- Slight duplication early: we maintain OIDC-surface code that Auth.js mostly handles internally. Worth it.
--- a/docs/adr/0005-event-schemas-protobuf.md
+++ b/docs/adr/0005-event-schemas-protobuf.md
@@ -0,0 +1,28 @@
+# ADR-0005: Protocol Buffers for event schemas, OpenAPI for HTTP
+
+## Status
+Accepted — 2026-04-13
+
+## Context
+Two contract surfaces exist:
+1. **HTTP** — synchronous, client ↔ server, human-readable debugging matters. OpenAPI is the default and generates decent TS clients.
+2. **Events** — durable, fan-out to ML consumers, schema evolution critical. Feature pipelines trained on old schemas will silently misbehave when producers change a field.
+
+Using OpenAPI for both means:
+- Python pydantic generation is awkward and hand-maintained in practice.
+- No wire-format discipline (JSON is loose).
+- No central schema registry, so schema drift is undetected until a model regresses.
+
+## Decision
+- **HTTP** contracts: OpenAPI 3.1 in `packages/shared-types/http/`. Generate TS clients; hand-write Python pydantic models for ML consumers (few, and they're shallow).
+- **Event** contracts: Protocol Buffers in `packages/shared-types/events/`. Generate TS and Python. All events carry an envelope: `{event_id, occurred_at, schema_version, producer, payload}`.
+- **Schema registry:** lightweight self-hosted (buf.build Schema Registry OSS or a tiny registry in `events/`). CI check blocks breaking changes without a version bump.
+- **Evolution rules:** additive only within a major version; `reserved` for removed fields; new `schema_version` for breaking changes; consumers advertise the versions they accept.
+
+## Consequences
+- One extra build step in `shared-types` (buf or protoc).
+- Breaking event changes cost something — good; they should.
+- ML pipelines can replay old events against new code with confidence.
+
+## Non-consequences
+- No gRPC. HTTP stays HTTP/JSON. Protobuf is only the wire format on the event bus.