chore: remove Airflow completely from the stack
Drop all four Airflow containers (db, init, webserver, scheduler) from the mlops compose profile, leaving MLflow as the sole mlops service. Remove AIRFLOW_* env vars, config fields, health-check entries, DAG trigger code in admin/bench routes, the airflow_dag_run_id schema column, Airflow nav links and DAG-run links in the admin UI, the two Airflow DAG files (bench_dag.py, sim_dag.py), and all related docs/ADR references. Simulations now run exclusively via the subprocess path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
88
README.md
88
README.md
@@ -104,13 +104,15 @@ User signals ──▶ Context assembler ──▶ LiteLLM ──▶ Ollam
|
||||
|
||||
**Why Ollama first:** Tips contain personal context. Local inference means no user data leaves the host for the inference path. Cloud models (Anthropic, OpenAI) are opt-in fallbacks for evaluation and simulation only, gated behind `ANTHROPIC_API_KEY`.
|
||||
|
||||
### Models (planned)
|
||||
### Models (planned; routes through LiteLLM)
|
||||
|
||||
| Alias | Model | Task |
|
||||
|-------|-------|------|
|
||||
| `tip-generator` | qwen2.5:7b (default) | Generate typed tip candidates from user context |
|
||||
| `embedder` | nomic-embed-text | Task clustering, semantic similarity for dedup |
|
||||
| `judge` | claude-haiku-4-5 (cloud, eval-only) | Offline sim judge; rates tip quality for A/B |
|
||||
| `tip-generator` | qwen2.5:1.5b (default) | Generate typed tip candidates from user context; local-first via Ollama |
|
||||
| `embedder` | nomic-embed-text | Task clustering, semantic similarity for dedup; local via Ollama |
|
||||
| `judge` | claude-haiku-4-5 (cloud, eval-only) | Offline sim judge; rates tip quality for A/B (requires `ANTHROPIC_API_KEY`) |
|
||||
|
||||
All model calls route through **LiteLLM** at `llm.alogins.net` (or `LITELLM_URL` env var) using model aliases. This decouples tip generation from model selection — swap the backend model in LiteLLM config without code changes. See ADR-0008.
|
||||
|
||||
---
|
||||
|
||||
@@ -134,22 +136,24 @@ Goal: tips are picked, not drawn from a hat — and they arrive at the right mom
|
||||
- [x] Event bus scaffold: typed in-process EventEmitter with 500-event ring buffer; subjects match future NATS JetStream — swap is mechanical
|
||||
- [x] Todoist sync emits `signals.task.synced`; tip served/feedback emit `signals.tip.*`
|
||||
- [x] Features extracted per task: `is_overdue`, `task_age_days`, `priority`; context: `hour_of_day`, `day_of_week`
|
||||
- [x] `ml/serving` LinUCB (d=5) + **ε-greedy v1** (d=7, ε=0.10, day-of-week sin/cos features); per-user state persisted to disk
|
||||
- [x] **ε-greedy v1** (d=7, ε=0.10, day-of-week sin/cos features); per-user state persisted to disk
|
||||
- [x] **ε-greedy v2** (d=12, profile features: completion rate, dismiss rate, dwell, preferred hour, tip volume) in shadow; promoted to active policy (ADR-0012)
|
||||
- [x] `RemotePolicy` in recommender: calls ml/serving, falls back to RandomPolicy on timeout/error; logs explainability to `tip_scores`
|
||||
- [x] Feedback loop: dwell-time inferred reward (`inferReward`) → online model update; `done` in 15 s–2 min = +1.0 (magic zone)
|
||||
- [x] Offline simulation framework (`ml/experiments/sim`): rule/LLM/claude-code judges, two-policy comparison, results persisted to `sim_runs` + `sim_events`
|
||||
- [x] **ε-greedy v1 promoted to active policy** (ADR-0007) — +10.7% mean reward vs LinUCB in offline sim
|
||||
- [x] **Web Push** (VAPID): SW, subscribe/unsubscribe API, "notify me" button on tip page
|
||||
- [x] Shadow-policy registry: run N shadow policies per request, log picks without serving them (#56)
|
||||
- [x] NATS JetStream bridge — durable `signals.>` and `feedback.>` streams; in-process bus stays the source of truth, every publish bridges out (#21, shipped)
|
||||
- [x] Per-user profile features (completion rate, dismiss rate, dwell, preferred hour, tip volume) — event-driven, JIT invalidation (#81)
|
||||
- [ ] Quiet-hours + dedupe for push delivery
|
||||
- [ ] Delayed rewards: tasks completed directly in Todoist (requires webhook from Todoist)
|
||||
- [x] NATS JetStream bridge — durable `signals.>` and `feedback.>` streams; in-process bus stays the source of truth, every publish bridges out (#21, shipped)
|
||||
- [ ] Apple OAuth (deferred to M3)
|
||||
|
||||
#### M1 add-on — Admin & ML Ops Console *(fully shipped)*
|
||||
|
||||
oO is ML-heavy. Without a cockpit, every model change ships blind. This console is the team's single pane for users, signals, features, models, experiments, and tip outcomes — with the ability to *act* on them (revoke a token, replay an event, promote a model, reset a bandit).
|
||||
|
||||
**Framework pick — `apps/admin` on Next.js 15 + Tremor + shadcn/ui.** Analytics-first UI for an analytics-first product, stays on our existing TS/React/Tailwind stack, reuses `packages/shared-types`, `sdk-js`, and the Auth.js session. Specialized ML tooling (MLflow, Airflow) runs as **separate external services** linked from the admin shell; Grafana panels are embedded.
|
||||
**Framework pick — `apps/admin` on Next.js 15 + Tremor + shadcn/ui.** Analytics-first UI for an analytics-first product, stays on our existing TS/React/Tailwind stack, reuses `packages/shared-types`, `sdk-js`, and the Auth.js session. Specialized ML tooling (MLflow) runs as a **separate external service** linked from the admin shell; Grafana panels are embedded.
|
||||
|
||||
| Layer | Tool | Why |
|
||||
|-------|------|-----|
|
||||
@@ -159,7 +163,6 @@ oO is ML-heavy. Without a cockpit, every model change ships blind. This console
|
||||
| Heavy grids | **[TanStack Table v8](https://tanstack.com/table)** | Sortable / paginated / virtualized tables (events, users, tips) |
|
||||
| Extra charts | **[Recharts](https://recharts.org)** / **[visx](https://airbnb.io/visx)** | Fallbacks where Tremor falls short (e.g. force graphs, Sankey) |
|
||||
| Model registry / experiments | **[MLflow](https://mlflow.org)** *(external — `o.alogins.net/mlflow`)* | Experiment tracking, artifact browser, model registry; own basic-auth |
|
||||
| Pipeline orchestration | **[Airflow](https://airflow.apache.org)** *(external — `o.alogins.net/airflow`)* | Batch feature + retraining DAGs; own web-auth |
|
||||
| Infra metrics | **[Grafana](https://grafana.com)** *(embedded panels)* | One ops source of truth |
|
||||
| Ad-hoc analysis | **[Marimo](https://marimo.io)** reactive notebooks | Python-native for the ML side; launch-out link |
|
||||
| AuthZ | `profile.role='admin'` + Next.js middleware | Reuses existing session; no new auth surface |
|
||||
@@ -170,27 +173,25 @@ oO is ML-heavy. Without a cockpit, every model change ships blind. This console
|
||||
- *React-admin / Refine.dev* — strong CRUD scaffolding, but analytics/ML views feel bolted on; we'd rebuild Tremor-style dashboards ourselves
|
||||
- *Superset / Metabase as the admin surface* — excellent for BI, poor for operational **writes** (revoke, replay, promote). Plan: **adopt Superset in M4** for BI alongside batch pipelines; ship a read-only SQL widget inside admin for now
|
||||
|
||||
**Build sequence (plan, not code):**
|
||||
**Build sequence:**
|
||||
1. [x] **ADR-0006** — record the framework choice + "embed, don't rebuild" rule for MLflow/Grafana
|
||||
2. [x] **Scaffold** — `apps/admin` with Next.js 15, Tailwind, Tremor; deploy behind Caddy at `admin.o.alogins.net`
|
||||
3. [x] **RBAC** — `role` column on `users`; admin-only Next.js middleware; seed first admin via `ADMIN_SEED_EMAIL` env; `admin_actions` audit-log table
|
||||
4. [x] **Overview dashboard** — DAU/WAU KPI cards, tips served, reaction breakdown, activation funnel
|
||||
5. [x] **User explorer** — list + detail page: identity, consents, integrations, last tip, reward history; revoke-integration + reset-bandit actions
|
||||
5. [x] **User explorer** — list + detail page: identity, consents, integrations, last tip, reward history; revoke-integration + reset-bandit + rebuild-profile actions
|
||||
6. [x] **Event stream viewer** — live tail of `signals.*` with filters by subject/user/time; same UI when the bus swaps to NATS
|
||||
7. [x] **Feature store browser** — features sent to `ml/serving` per scoring call; diff across time for a user
|
||||
8. [x] **Model registry panel** — `/admin/models` links out to MLflow (`mlflow.o.alogins.net`); experiment tracking and dataset management in MLflow + Airflow
|
||||
9. [x] **MLOps hub** — `/admin/experiments` links to MLflow experiments/models and Airflow DAGs/datasets; bandit reset on Users page
|
||||
10. [x] **Recommendation log (explainability)** — per served tip: `(user, features, policy, score, feedback, latency)`; `tip_scores` table, 30-day retention
|
||||
11. [x] **Reward analytics** — reaction distribution over time; per-policy compare; slice by `hour_of_day`, `priority`, cohort
|
||||
12. [x] **Data quality widget** — missing-feature rate, stale-token rate, daily completeness heatmap
|
||||
13. [x] **Ops actions** — revoke token (Users page), replay signal, disable/promote shadow policy; every action audit-logged
|
||||
14. [x] **Read-only SQL runner** — SELECT-only runner against SQLite + saved queries (sunsets to Superset in M4)
|
||||
15. [x] **Health rollup** — `/admin/health` surfaces api, ml/serving, SQLite, event-bus; auto-refreshes every 15s
|
||||
16. [ ] **Docs** — `apps/admin/README.md`, runbook for common ops actions, ADR-0006 merged
|
||||
7. [x] **Features page** — features sent to `ml/serving` per scoring call; per-user profile features with freshness; diff across time
|
||||
8. [x] **Tips page** — tips served, scored, feedback reactions with policy/model breakdown
|
||||
9. [x] **Reward analytics** — reaction distribution over time; per-policy / per-model / per-prompt-version compare; slice by `hour_of_day`, `priority`, cohort
|
||||
10. [x] **Data quality widget** — missing-feature rate, stale-token rate, daily completeness heatmap; per-feature freshness SLA status
|
||||
11. [x] **Ops actions** — revoke token (Users page), rebuild profile, reset bandit, enable/disable shadow policies; every action audit-logged
|
||||
12. [x] **Health rollup** — `/admin/health` surfaces api, ml/serving, SQLite, event-bus, MLflow; auto-refreshes every 15s
|
||||
13. [x] **Read-only SQL runner** — SELECT-only runner against SQLite + saved queries (sunsets to Superset in M4)
|
||||
14. [x] **Offline simulation runner** — launch `ml/experiments/sim` from admin UI; track sim runs, judge, policy comparison
|
||||
15. [x] **Token-based admin auth** — `POST /api/auth/token` for Playwright/CI; `ADMIN_TOKEN` env var (#105)
|
||||
16. [x] **Docs pages** — admin documentation and runbooks inline
|
||||
|
||||
- [ ] Apple OAuth (deferred to M2)
|
||||
|
||||
### Phase 2 — AI tips + multi-source signals *(M2)*
|
||||
### Phase 2 — AI tips + multi-source signals *(M2)* in progress
|
||||
Goal: tips are AI-generated from user context, not just raw Todoist tasks. Multiple signal sources feed a generalized pipeline. Research-intensive milestone.
|
||||
|
||||
**AI infrastructure (unblock everything else):**
|
||||
@@ -198,21 +199,21 @@ Goal: tips are AI-generated from user context, not just raw Todoist tasks. Multi
|
||||
- [ ] AI gateway — wire `ml/serving` to LiteLLM; model aliases `tip-generator` + `embedder` (#87)
|
||||
|
||||
**AI tip generation pipeline:**
|
||||
- [ ] Context assembler — user signals + feature store → structured prompt context (`ml/features/context.py`) (#88)
|
||||
- [x] Context assembler — user signals + feature store → structured prompt context (`ml/features/context.py`); skeleton implemented
|
||||
- [ ] Tip generator endpoint — `POST /generate` in `ml/serving`; LLM → N typed `TipCandidate` objects (#79)
|
||||
- [ ] `TipCandidate` shared schema — `{content, kind, source, model, prompt_version, confidence}`; update recommender pipeline (#89)
|
||||
- [ ] LLM output validation + retry — JSON schema gate, clarification retry (2×), fallback to task-based (#90)
|
||||
- [ ] Prompt versioning — `prompt_version` + `model` columns in `tip_scores`; content-hash invalidation (#91)
|
||||
- [ ] LLM tip quality dashboard — reaction breakdown by model / prompt_version in `/admin/reward-analytics` (#92)
|
||||
- [x] LLM tip quality dashboard — reaction breakdown by model / prompt_version in `/admin/reward-analytics` (#92)
|
||||
|
||||
**Evaluation & model selection:**
|
||||
- [ ] Model benchmark — compare qwen2.5:7b / llama3.2:3b / gemma3:4b via offline sim + LLM judge (#93)
|
||||
- [ ] LLM prompt research — persona design, context injection strategies, few-shot examples (#84)
|
||||
|
||||
**Pipeline architecture:**
|
||||
- [ ] Signal source abstraction — `SignalSource` interface generalizing beyond Todoist (#78)
|
||||
- [x] Signal source abstraction — `SignalSource` interface for Todoist + extensible design (#78)
|
||||
- [ ] Generalized recommendation pipeline — candidate → rank → render stages (#80)
|
||||
- [ ] Feature registry + user profile builder — centralized features, persistent profiles (#81)
|
||||
- [x] Feature registry + user profile builder — centralized features, persistent profiles, event-driven invalidation (#81)
|
||||
- [ ] Tip kind system — task, advice, insight, reminder with kind-aware UI + rewards (#82)
|
||||
|
||||
**Policy research:**
|
||||
@@ -222,33 +223,36 @@ Goal: tips are AI-generated from user context, not just raw Todoist tasks. Multi
|
||||
- [ ] Apple OAuth (#7)
|
||||
- [x] NATS JetStream replacing in-process bus (#21) — adapter ships in `services/api/src/events/nats.ts`; in-proc bus is the producer, JetStream is the durable mirror
|
||||
- [x] Todoist sync via events (#22) — background scheduler in `services/api/src/signals/scheduler.ts` emits `signals.task.synced` every `TODOIST_SYNC_INTERVAL_MS`; on-demand fetch remains as freshness fallback
|
||||
- [ ] Event schema registry + protobuf CI gate (#54)
|
||||
- [ ] Per-user freshness SLAs for features (#61)
|
||||
- [ ] CI skeleton (#3), observability (#18), E2E tests (#20)
|
||||
- [x] Event schema registry + protobuf CI gate (#54) — buf lint/breaking checks on every PR
|
||||
- [x] Per-user freshness SLAs for features (#61) — context-feature (JIT) vs profile-feature (batched) spec in ADR-0011; CONTEXT_FEATURES in ml/features/context.py
|
||||
- [x] Observability (#18) — structured logs via pino, W3C trace IDs, Sentry hooks, trace correlation end-to-end
|
||||
- [ ] CI skeleton (#3), E2E tests (#20)
|
||||
|
||||
**Bugs (fix before new features):**
|
||||
- [ ] TipFeedback type mismatch (#73)
|
||||
- [ ] Todoist token refresh (#74)
|
||||
- [ ] Reward fire-and-forget (#75)
|
||||
- [ ] Data retention purge (#76)
|
||||
- [ ] Port mismatch (#77)
|
||||
**Bugs & UX (fix before new features):**
|
||||
- [x] TipFeedback type mismatch (#73)
|
||||
- [x] Todoist token refresh (#74) — OAuth token auto-refresh on 401
|
||||
- [x] Reward fire-and-forget (#75) — retry logic + logging
|
||||
- [x] Data retention purge (#76) — daily purge of 30-day-old tip_scores/tip_feedback
|
||||
- [x] Port mismatch (#77) — fixed in docker-compose + env var config
|
||||
- [ ] UX refinements (#100–102) — "done/snooze/dismiss" feedback only, config page UI, settings gear button
|
||||
|
||||
### Phase 3 — Native mobile *(M3)*
|
||||
- [ ] iOS app (SwiftUI) with APNs push
|
||||
- [ ] Android app (Compose) with FCM push
|
||||
- [ ] `notifier` gains APNs + FCM channels, per-device rate limits
|
||||
- [ ] Migrate auth from Auth.js to dedicated OIDC provider (trigger from ADR-0004)
|
||||
- [ ] Consolidate MLflow + Airflow behind shared OIDC (SSO for all internal services)
|
||||
- [ ] Consolidate MLflow behind shared OIDC (SSO for all internal services)
|
||||
- [ ] Decide-and-deliver scheduler: per-user "is this tip worth interrupting now?" threshold
|
||||
|
||||
### Phase 4 — MLOps at scale *(M4)*
|
||||
- [x] Airflow + MLflow deployed as external services (`mlops` compose profile); each with own auth
|
||||
- [ ] Write first retraining DAG (Airflow) + first MLflow experiment logging from `ml/serving`
|
||||
- [ ] Feature-to-prompt pipeline — nightly Airflow DAG materializes context for LLM; cuts inline latency (#94)
|
||||
- [x] MLflow deployed as external service (`mlops` compose profile); own auth; health check integrated
|
||||
- [ ] Write first retraining pipeline + first MLflow experiment logging from `ml/serving` + JetStream consumers (#98)
|
||||
- [ ] Feature-to-prompt pipeline — nightly batch job materializes context for LLM; cuts inline latency (#94)
|
||||
- [ ] Prompt optimization loop — sim A/B → MLflow experiment → human-approved promotion (#95)
|
||||
- [ ] LLM fine-tuning — tip reactions as training signal; LoRA on base model; MLflow tracks runs (#96)
|
||||
- [ ] Embedding-based task clustering — `nomic-embed-text` for dedup + user pattern features (#97)
|
||||
- [ ] Consolidate MLflow + Airflow auth into shared OIDC provider (tracked as M3 issue #85)
|
||||
- [ ] Modular-monolith packaging + import-boundary lint (#47)
|
||||
- [ ] Consolidate MLflow auth into shared OIDC provider (tracked as M3 issue #85)
|
||||
- [ ] Shadow → A/B → launch pipeline as first-class in MLflow
|
||||
- [ ] Online experiments framework: deterministic assignment + bandit policies alongside fixed-split A/B
|
||||
- [ ] Cross-user collaborative features (opt-in only); cohort slicing; fairness checks
|
||||
|
||||
Reference in New Issue
Block a user