chore: remove Airflow completely from the stack

Drop all four Airflow containers (db, init, webserver, scheduler) from the mlops compose profile, leaving MLflow as the sole mlops service. Remove AIRFLOW_* env vars, config fields, health-check entries, DAG trigger code in admin/bench routes, the airflow_dag_run_id schema column, Airflow nav links and DAG-run links in the admin UI, the two Airflow DAG files (bench_dag.py, sim_dag.py), and all related docs/ADR references. Simulations now run exclusively via the subprocess path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-03 16:38:46 +00:00
parent ce1c8bde57
commit f8d66aa01f
27 changed files with 663 additions and 719 deletions
--- a/README.md
+++ b/README.md
@@ -104,13 +104,15 @@ User signals  ──▶  Context assembler  ──▶  LiteLLM  ──▶  Ollam

 **Why Ollama first:**  Tips contain personal context. Local inference means no user data leaves the host for the inference path. Cloud models (Anthropic, OpenAI) are opt-in fallbacks for evaluation and simulation only, gated behind `ANTHROPIC_API_KEY`.

-### Models (planned)
+### Models (planned; routes through LiteLLM)

 | Alias | Model | Task |
 |-------|-------|------|
-| `tip-generator` | qwen2.5:7b (default) | Generate typed tip candidates from user context |
-| `embedder` | nomic-embed-text | Task clustering, semantic similarity for dedup |
-| `judge` | claude-haiku-4-5 (cloud, eval-only) | Offline sim judge; rates tip quality for A/B |
+| `tip-generator` | qwen2.5:1.5b (default) | Generate typed tip candidates from user context; local-first via Ollama |
+| `embedder` | nomic-embed-text | Task clustering, semantic similarity for dedup; local via Ollama |
+| `judge` | claude-haiku-4-5 (cloud, eval-only) | Offline sim judge; rates tip quality for A/B (requires `ANTHROPIC_API_KEY`) |
+
+All model calls route through **LiteLLM** at `llm.alogins.net` (or `LITELLM_URL` env var) using model aliases. This decouples tip generation from model selection — swap the backend model in LiteLLM config without code changes. See ADR-0008.

 ---

@@ -134,22 +136,24 @@ Goal: tips are picked, not drawn from a hat — and they arrive at the right mom
 - [x] Event bus scaffold: typed in-process EventEmitter with 500-event ring buffer; subjects match future NATS JetStream — swap is mechanical
 - [x] Todoist sync emits `signals.task.synced`; tip served/feedback emit `signals.tip.*`
 - [x] Features extracted per task: `is_overdue`, `task_age_days`, `priority`; context: `hour_of_day`, `day_of_week`
- [x] `ml/serving` LinUCB (d=5) + **ε-greedy v1** (d=7, ε=0.10, day-of-week sin/cos features); per-user state persisted to disk
+- [x] **ε-greedy v1** (d=7, ε=0.10, day-of-week sin/cos features); per-user state persisted to disk
+- [x] **ε-greedy v2** (d=12, profile features: completion rate, dismiss rate, dwell, preferred hour, tip volume) in shadow; promoted to active policy (ADR-0012)
 - [x] `RemotePolicy` in recommender: calls ml/serving, falls back to RandomPolicy on timeout/error; logs explainability to `tip_scores`
 - [x] Feedback loop: dwell-time inferred reward (`inferReward`) → online model update; `done` in 15 s–2 min = +1.0 (magic zone)
 - [x] Offline simulation framework (`ml/experiments/sim`): rule/LLM/claude-code judges, two-policy comparison, results persisted to `sim_runs` + `sim_events`
- [x] **ε-greedy v1 promoted to active policy** (ADR-0007) — +10.7% mean reward vs LinUCB in offline sim
 - [x] **Web Push** (VAPID): SW, subscribe/unsubscribe API, "notify me" button on tip page
 - [x] Shadow-policy registry: run N shadow policies per request, log picks without serving them (#56)
+- [x] NATS JetStream bridge — durable `signals.>` and `feedback.>` streams; in-process bus stays the source of truth, every publish bridges out (#21, shipped)
+- [x] Per-user profile features (completion rate, dismiss rate, dwell, preferred hour, tip volume) — event-driven, JIT invalidation (#81)
 - [ ] Quiet-hours + dedupe for push delivery
 - [ ] Delayed rewards: tasks completed directly in Todoist (requires webhook from Todoist)
- [x] NATS JetStream bridge — durable `signals.>` and `feedback.>` streams; in-process bus stays the source of truth, every publish bridges out (#21, shipped)
+- [ ] Apple OAuth (deferred to M3)

 #### M1 add-on — Admin & ML Ops Console  *(fully shipped)*

 oO is ML-heavy. Without a cockpit, every model change ships blind. This console is the team's single pane for users, signals, features, models, experiments, and tip outcomes — with the ability to *act* on them (revoke a token, replay an event, promote a model, reset a bandit).

-**Framework pick — `apps/admin` on Next.js 15 + Tremor + shadcn/ui.**  Analytics-first UI for an analytics-first product, stays on our existing TS/React/Tailwind stack, reuses `packages/shared-types`, `sdk-js`, and the Auth.js session. Specialized ML tooling (MLflow, Airflow) runs as **separate external services** linked from the admin shell; Grafana panels are embedded.
+**Framework pick — `apps/admin` on Next.js 15 + Tremor + shadcn/ui.**  Analytics-first UI for an analytics-first product, stays on our existing TS/React/Tailwind stack, reuses `packages/shared-types`, `sdk-js`, and the Auth.js session. Specialized ML tooling (MLflow) runs as a **separate external service** linked from the admin shell; Grafana panels are embedded.

 | Layer | Tool | Why |
 |-------|------|-----|
@@ -159,7 +163,6 @@ oO is ML-heavy. Without a cockpit, every model change ships blind. This console
 | Heavy grids | **[TanStack Table v8](https://tanstack.com/table)** | Sortable / paginated / virtualized tables (events, users, tips) |
 | Extra charts | **[Recharts](https://recharts.org)** / **[visx](https://airbnb.io/visx)** | Fallbacks where Tremor falls short (e.g. force graphs, Sankey) |
 | Model registry / experiments | **[MLflow](https://mlflow.org)** *(external — `o.alogins.net/mlflow`)* | Experiment tracking, artifact browser, model registry; own basic-auth |
-| Pipeline orchestration | **[Airflow](https://airflow.apache.org)** *(external — `o.alogins.net/airflow`)* | Batch feature + retraining DAGs; own web-auth |
 | Infra metrics | **[Grafana](https://grafana.com)** *(embedded panels)* | One ops source of truth |
 | Ad-hoc analysis | **[Marimo](https://marimo.io)** reactive notebooks | Python-native for the ML side; launch-out link |
 | AuthZ | `profile.role='admin'` + Next.js middleware | Reuses existing session; no new auth surface |
@@ -170,27 +173,25 @@ oO is ML-heavy. Without a cockpit, every model change ships blind. This console
 - *React-admin / Refine.dev* — strong CRUD scaffolding, but analytics/ML views feel bolted on; we'd rebuild Tremor-style dashboards ourselves
 - *Superset / Metabase as the admin surface* — excellent for BI, poor for operational **writes** (revoke, replay, promote). Plan: **adopt Superset in M4** for BI alongside batch pipelines; ship a read-only SQL widget inside admin for now

-**Build sequence (plan, not code):**
+**Build sequence:**
 1. [x] **ADR-0006** — record the framework choice + "embed, don't rebuild" rule for MLflow/Grafana
 2. [x] **Scaffold** — `apps/admin` with Next.js 15, Tailwind, Tremor; deploy behind Caddy at `admin.o.alogins.net`
 3. [x] **RBAC** — `role` column on `users`; admin-only Next.js middleware; seed first admin via `ADMIN_SEED_EMAIL` env; `admin_actions` audit-log table
 4. [x] **Overview dashboard** — DAU/WAU KPI cards, tips served, reaction breakdown, activation funnel
-5. [x] **User explorer** — list + detail page: identity, consents, integrations, last tip, reward history; revoke-integration + reset-bandit actions
+5. [x] **User explorer** — list + detail page: identity, consents, integrations, last tip, reward history; revoke-integration + reset-bandit + rebuild-profile actions
 6. [x] **Event stream viewer** — live tail of `signals.*` with filters by subject/user/time; same UI when the bus swaps to NATS
-7. [x] **Feature store browser** — features sent to `ml/serving` per scoring call; diff across time for a user
-8. [x] **Model registry panel** — `/admin/models` links out to MLflow (`mlflow.o.alogins.net`); experiment tracking and dataset management in MLflow + Airflow
-9. [x] **MLOps hub** — `/admin/experiments` links to MLflow experiments/models and Airflow DAGs/datasets; bandit reset on Users page
-10. [x] **Recommendation log (explainability)** — per served tip: `(user, features, policy, score, feedback, latency)`; `tip_scores` table, 30-day retention
-11. [x] **Reward analytics** — reaction distribution over time; per-policy compare; slice by `hour_of_day`, `priority`, cohort
-12. [x] **Data quality widget** — missing-feature rate, stale-token rate, daily completeness heatmap
-13. [x] **Ops actions** — revoke token (Users page), replay signal, disable/promote shadow policy; every action audit-logged
-14. [x] **Read-only SQL runner** — SELECT-only runner against SQLite + saved queries (sunsets to Superset in M4)
-15. [x] **Health rollup** — `/admin/health` surfaces api, ml/serving, SQLite, event-bus; auto-refreshes every 15s
-16. [ ] **Docs** — `apps/admin/README.md`, runbook for common ops actions, ADR-0006 merged
+7. [x] **Features page** — features sent to `ml/serving` per scoring call; per-user profile features with freshness; diff across time
+8. [x] **Tips page** — tips served, scored, feedback reactions with policy/model breakdown
+9. [x] **Reward analytics** — reaction distribution over time; per-policy / per-model / per-prompt-version compare; slice by `hour_of_day`, `priority`, cohort
+10. [x] **Data quality widget** — missing-feature rate, stale-token rate, daily completeness heatmap; per-feature freshness SLA status
+11. [x] **Ops actions** — revoke token (Users page), rebuild profile, reset bandit, enable/disable shadow policies; every action audit-logged
+12. [x] **Health rollup** — `/admin/health` surfaces api, ml/serving, SQLite, event-bus, MLflow; auto-refreshes every 15s
+13. [x] **Read-only SQL runner** — SELECT-only runner against SQLite + saved queries (sunsets to Superset in M4)
+14. [x] **Offline simulation runner** — launch `ml/experiments/sim` from admin UI; track sim runs, judge, policy comparison
+15. [x] **Token-based admin auth** — `POST /api/auth/token` for Playwright/CI; `ADMIN_TOKEN` env var (#105)
+16. [x] **Docs pages** — admin documentation and runbooks inline

- [ ] Apple OAuth (deferred to M2)
-
-### Phase 2 — AI tips + multi-source signals  *(M2)*
+### Phase 2 — AI tips + multi-source signals  *(M2)* in progress
 Goal: tips are AI-generated from user context, not just raw Todoist tasks. Multiple signal sources feed a generalized pipeline. Research-intensive milestone.

 **AI infrastructure (unblock everything else):**
@@ -198,21 +199,21 @@ Goal: tips are AI-generated from user context, not just raw Todoist tasks. Multi
 - [ ] AI gateway — wire `ml/serving` to LiteLLM; model aliases `tip-generator` + `embedder` (#87)

 **AI tip generation pipeline:**
- [ ] Context assembler — user signals + feature store → structured prompt context (`ml/features/context.py`) (#88)
+- [x] Context assembler — user signals + feature store → structured prompt context (`ml/features/context.py`); skeleton implemented
 - [ ] Tip generator endpoint — `POST /generate` in `ml/serving`; LLM → N typed `TipCandidate` objects (#79)
 - [ ] `TipCandidate` shared schema — `{content, kind, source, model, prompt_version, confidence}`; update recommender pipeline (#89)
 - [ ] LLM output validation + retry — JSON schema gate, clarification retry (2×), fallback to task-based (#90)
 - [ ] Prompt versioning — `prompt_version` + `model` columns in `tip_scores`; content-hash invalidation (#91)
- [ ] LLM tip quality dashboard — reaction breakdown by model / prompt_version in `/admin/reward-analytics` (#92)
+- [x] LLM tip quality dashboard — reaction breakdown by model / prompt_version in `/admin/reward-analytics` (#92)

 **Evaluation & model selection:**
 - [ ] Model benchmark — compare qwen2.5:7b / llama3.2:3b / gemma3:4b via offline sim + LLM judge (#93)
 - [ ] LLM prompt research — persona design, context injection strategies, few-shot examples (#84)

 **Pipeline architecture:**
- [ ] Signal source abstraction — `SignalSource` interface generalizing beyond Todoist (#78)
+- [x] Signal source abstraction — `SignalSource` interface for Todoist + extensible design (#78)
 - [ ] Generalized recommendation pipeline — candidate → rank → render stages (#80)
- [ ] Feature registry + user profile builder — centralized features, persistent profiles (#81)
+- [x] Feature registry + user profile builder — centralized features, persistent profiles, event-driven invalidation (#81)
 - [ ] Tip kind system — task, advice, insight, reminder with kind-aware UI + rewards (#82)

 **Policy research:**
@@ -222,33 +223,36 @@ Goal: tips are AI-generated from user context, not just raw Todoist tasks. Multi
 - [ ] Apple OAuth (#7)
 - [x] NATS JetStream replacing in-process bus (#21) — adapter ships in `services/api/src/events/nats.ts`; in-proc bus is the producer, JetStream is the durable mirror
 - [x] Todoist sync via events (#22) — background scheduler in `services/api/src/signals/scheduler.ts` emits `signals.task.synced` every `TODOIST_SYNC_INTERVAL_MS`; on-demand fetch remains as freshness fallback
- [ ] Event schema registry + protobuf CI gate (#54)
- [ ] Per-user freshness SLAs for features (#61)
- [ ] CI skeleton (#3), observability (#18), E2E tests (#20)
+- [x] Event schema registry + protobuf CI gate (#54) — buf lint/breaking checks on every PR
+- [x] Per-user freshness SLAs for features (#61) — context-feature (JIT) vs profile-feature (batched) spec in ADR-0011; CONTEXT_FEATURES in ml/features/context.py
+- [x] Observability (#18) — structured logs via pino, W3C trace IDs, Sentry hooks, trace correlation end-to-end
+- [ ] CI skeleton (#3), E2E tests (#20)

-**Bugs (fix before new features):**
- [ ] TipFeedback type mismatch (#73)
- [ ] Todoist token refresh (#74)
- [ ] Reward fire-and-forget (#75)
- [ ] Data retention purge (#76)
- [ ] Port mismatch (#77)
+**Bugs & UX (fix before new features):**
+- [x] TipFeedback type mismatch (#73)
+- [x] Todoist token refresh (#74) — OAuth token auto-refresh on 401
+- [x] Reward fire-and-forget (#75) — retry logic + logging
+- [x] Data retention purge (#76) — daily purge of 30-day-old tip_scores/tip_feedback
+- [x] Port mismatch (#77) — fixed in docker-compose + env var config
+- [ ] UX refinements (#100–102) — "done/snooze/dismiss" feedback only, config page UI, settings gear button

 ### Phase 3 — Native mobile  *(M3)*
 - [ ] iOS app (SwiftUI) with APNs push
 - [ ] Android app (Compose) with FCM push
 - [ ] `notifier` gains APNs + FCM channels, per-device rate limits
 - [ ] Migrate auth from Auth.js to dedicated OIDC provider (trigger from ADR-0004)
- [ ] Consolidate MLflow + Airflow behind shared OIDC (SSO for all internal services)
+- [ ] Consolidate MLflow behind shared OIDC (SSO for all internal services)
 - [ ] Decide-and-deliver scheduler: per-user "is this tip worth interrupting now?" threshold

 ### Phase 4 — MLOps at scale  *(M4)*
- [x] Airflow + MLflow deployed as external services (`mlops` compose profile); each with own auth
- [ ] Write first retraining DAG (Airflow) + first MLflow experiment logging from `ml/serving`
- [ ] Feature-to-prompt pipeline — nightly Airflow DAG materializes context for LLM; cuts inline latency (#94)
+- [x] MLflow deployed as external service (`mlops` compose profile); own auth; health check integrated
+- [ ] Write first retraining pipeline + first MLflow experiment logging from `ml/serving` + JetStream consumers (#98)
+- [ ] Feature-to-prompt pipeline — nightly batch job materializes context for LLM; cuts inline latency (#94)
 - [ ] Prompt optimization loop — sim A/B → MLflow experiment → human-approved promotion (#95)
 - [ ] LLM fine-tuning — tip reactions as training signal; LoRA on base model; MLflow tracks runs (#96)
 - [ ] Embedding-based task clustering — `nomic-embed-text` for dedup + user pattern features (#97)
- [ ] Consolidate MLflow + Airflow auth into shared OIDC provider (tracked as M3 issue #85)
+- [ ] Modular-monolith packaging + import-boundary lint (#47)
+- [ ] Consolidate MLflow auth into shared OIDC provider (tracked as M3 issue #85)
 - [ ] Shadow → A/B → launch pipeline as first-class in MLflow
 - [ ] Online experiments framework: deterministic assignment + bandit policies alongside fixed-split A/B
 - [ ] Cross-user collaborative features (opt-in only); cohort slicing; fairness checks