chore: remove Airflow completely from the stack

Drop all four Airflow containers (db, init, webserver, scheduler) from the
mlops compose profile, leaving MLflow as the sole mlops service. Remove
AIRFLOW_* env vars, config fields, health-check entries, DAG trigger code
in admin/bench routes, the airflow_dag_run_id schema column, Airflow nav
links and DAG-run links in the admin UI, the two Airflow DAG files
(bench_dag.py, sim_dag.py), and all related docs/ADR references.
Simulations now run exclusively via the subprocess path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-03 16:38:46 +00:00
parent ce1c8bde57
commit f8d66aa01f
27 changed files with 663 additions and 719 deletions

View File

@@ -33,11 +33,10 @@ Same stack as `apps/web`. Reuses `packages/shared-types`, the Auth.js session co
Specialized MLOps tooling runs as **separate external services** with their own auth, linked from the admin shell — not embedded or reimplemented:
- **MLflow** → `https://o.alogins.net/mlflow` — experiment tracking, model registry, artifact browser; own basic-auth for now; see M3 for SSO consolidation
- **Airflow** → `https://o.alogins.net/airflow` — batch pipeline orchestration, dataset management; own web-auth for now
- **Grafana panels** → `/admin/infra` (iframed panels) — infra metrics
- **Marimo notebooks** → launch-out link from admin
The admin shell links to these services; clicking them opens a new tab. The `/experiments` and `/models` admin pages are hub pages with direct links to the relevant MLflow/Airflow views.
The admin shell links to these services; clicking them opens a new tab.
### AuthZ
@@ -56,7 +55,7 @@ The admin shell links to these services; clicking them opens a new tab. The `/ex
- One more Next.js app in the monorepo. Build/dev added to Turborepo.
- Tremor + shadcn/ui are added as dependencies. shadcn components are copied into `apps/admin/src/components/ui/` — no runtime version coupling.
- MLflow (`o.alogins.net/mlflow*` → port 5000) and Airflow (`o.alogins.net/airflow*` → port 8080) are path-based routes in the existing `o.alogins.net` Caddy block, started via `docker compose --profile mlops up`.
- Each service manages its own auth (MLflow: built-in basic-auth; Airflow: built-in web UI auth). M3 will consolidate both behind the shared OIDC provider.
- The `NEXT_PUBLIC_MLFLOW_URL` and `NEXT_PUBLIC_AIRFLOW_URL` build args in `Dockerfile.admin` default to the production URLs; override for dev builds.
- MLflow (`o.alogins.net/mlflow*` → port 5000) is a path-based route in the existing `o.alogins.net` Caddy block, started via `docker compose --profile mlops up`.
- MLflow manages its own auth (built-in basic-auth). M3 will consolidate behind the shared OIDC provider.
- The `NEXT_PUBLIC_MLFLOW_URL` build arg in `Dockerfile.admin` defaults to the production URL; override for dev builds.
- `admin_actions` audit log grows unboundedly — needs a retention policy before M4.

View File

@@ -0,0 +1,106 @@
# ADR-0013 — Multi-agent recommendation: pre-computed agent snippets + orchestrator LLM
**Status:** Accepted
**Date:** 2026-05-01
**Supersedes:** ADR-0007, ADR-0012
## Context
The ε-greedy bandit (ADR-0007, promoted to v2 in ADR-0012) was the first recommendation
policy. It served adequately during early M1 testing but carries structural problems that
become more acute as the user base grows:
- **Training signal sparsity.** The median user generates fewer than 5 reward signals per
week. Ridge regression on a 12-dimensional feature vector needs far more signal than
that to converge to a meaningful θ before the user loses interest.
- **Cold-start cost.** Every new user starts with an uninformed identity matrix. Early tips
are essentially random for the first weeks of use — precisely when first impressions
matter most.
- **Opacity.** The bandit cannot explain why it chose a tip. An orchestrator that reasons
explicitly over named agent outputs ("3 overdue tasks + peak hour approaching") is
interpretable by design.
- **Coupling of generation and selection.** The current pipeline generates candidates, then
scores them; the scoring is decoupled from the LLM reasoning. Giving the LLM the full
pre-computed context directly is a simpler and more capable design.
## Decision
Replace the RL bandit with a **multi-agent pipeline**:
### Sub-agents (async, pre-computed)
Multiple domain-specialized Python agents each analyze user state from one angle and
produce a **prompt snippet** — a short natural-language paragraph describing what they
found. They do not produce tips. They run periodically (every 15 minutes) and store
results in the new `agent_outputs` table with per-agent TTLs.
Initial agent set:
| Agent | ID | TTL |
|---|---|---|
| OverdueTaskAgent | `overdue-task` | 1h |
| MomentumAgent | `momentum` | 6h |
| TimeOfDayAgent | `time-of-day` | 15m |
| RecentPatternsAgent | `recent-patterns` | 24h |
| FocusAreaAgent | `focus-area` | 12h |
### Orchestrator agent (real-time)
When a user requests a tip, the TypeScript recommender:
1. Fetches all non-expired `agent_outputs` rows for the user.
2. Calls `POST /recommend` on `ml/serving` with the snippet list.
3. `ml/serving` assembles a single orchestrator prompt (template `v4-orchestrator`)
that concatenates all snippets, then calls LiteLLM via the existing `tip-generator`
alias to produce one tip.
No bandit scoring. No reward delivery to an ML model. The LLM receives full context and
generates the tip in one call.
### Feedback
`tipFeedback` rows are still written on every user reaction. `inferReward()` still runs
and `rewardMilli` is logged for observability and potential future supervised learning.
Reactions are not delivered to an ML endpoint.
## New data model
```sql
CREATE TABLE agent_outputs (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL REFERENCES users(id),
agent_id TEXT NOT NULL, -- e.g. 'overdue-task'
prompt_text TEXT NOT NULL, -- snippet produced by the agent
signals_snapshot TEXT, -- JSON: inputs the agent consumed
computed_at TEXT NOT NULL, -- ISO 8601
expires_at TEXT NOT NULL, -- ISO 8601 = computed_at + TTL
agent_version TEXT NOT NULL -- bump to invalidate cached outputs on logic changes
);
CREATE INDEX idx_agent_outputs_user_agent_exp
ON agent_outputs(user_id, agent_id, expires_at DESC);
```
## Consequences
### Positive
- Tips are explainable: `featuresJson` in `tipScores` records which agents contributed.
- Cold-start is eliminated: the orchestrator reasons from signals immediately, no warm-up.
- Adding or removing an agent is a self-contained change in `ml/agents/`.
- Swapping LLM models remains a config change (LiteLLM alias unchanged).
### Negative / risks
- **No automatic exploration.** The bandit would discover that a user prefers certain tip
types without being told. The orchestrator only knows what the agents tell it.
Mitigation: agents can evolve to encode richer signals; offline evaluation via the
existing bench scripts remain available.
- **Scheduler dependency.** If the pre-compute job falls behind, agent outputs go
stale. Mitigation: the orchestrator falls back to raw signal prompt when no outputs
exist; `TimeOfDayAgent` recomputes every 15 min to stay fresh.
- **Higher per-request token cost.** The orchestrator prompt is longer than the old bandit
prompt. Mitigation: the `tip-generator` alias points to a small local model; token cost
is negligible at current scale.
## Migration sequence
See plan document in conversation context. 10 steps; each independently deployable and
rollback-able. Cutover is Step 6 (single TypeScript PR). Bandit endpoints removed in
Step 7 after 48h clean traffic.

View File

@@ -47,7 +47,6 @@ User reactions (done / snooze / dismiss) are events too. They close the loop as
- **OpenAPI** for HTTP; TS client auto-generated; Python pydantic hand-written while consumers are few.
- **Feast** for feature store when we get there; homegrown adapter until then (Phase 1 seam).
- **MLflow** for model registry and experiment tracking; deployed at `o.alogins.net/mlflow`.
- **Airflow** for batch pipelines; deployed at `o.alogins.net/airflow`.
- **Auth.js** embedded behind an OIDC-shaped boundary (ADR-0004). Swap to a standalone OIDC provider when mobile ships.
- **k3s** as the first step beyond docker-compose — no "compose → full k8s" cliff.