chore: remove Airflow completely from the stack

Drop all four Airflow containers (db, init, webserver, scheduler) from the mlops compose profile, leaving MLflow as the sole mlops service. Remove AIRFLOW_* env vars, config fields, health-check entries, DAG trigger code in admin/bench routes, the airflow_dag_run_id schema column, Airflow nav links and DAG-run links in the admin UI, the two Airflow DAG files (bench_dag.py, sim_dag.py), and all related docs/ADR references. Simulations now run exclusively via the subprocess path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-03 16:38:46 +00:00
parent ce1c8bde57
commit f8d66aa01f
27 changed files with 663 additions and 719 deletions
--- a/ml/experiments/bench/AIRFLOW.md
+++ b/ml/experiments/bench/AIRFLOW.md
@@ -1,90 +0,0 @@
-# Airflow Integration — `bench_collect` DAG
-
-The benchmark harness integrates with Airflow as a DAG (`ml/pipelines/bench_dag.py`)
-triggered on-demand from the admin UI or the CLI.
-
-## DAG Structure
-
-Three linked tasks:
-
-1. **`collect`** — `collect.py` generates candidates per (model × prompt × scenario) cell,
-   logs MLflow runs with `judge_pending=true`. Rejects models >4B, uses `keep_alive=0`
-   for RAM safety.
-
-2. **`export_for_judge`** — `judge_cli.py --export` pulls pending runs into a single
-   JSON file for Claude Code to score per the rubric. XCom-pushes the path so the
-   next task can find it.
-
-3. **`compare`** — `compare.py` aggregates scores by (model, prompt) cell and
-   generates the leaderboard ranked by composite score.
-
-## Triggering from the CLI
-
-```bash
-# Minimal: use all defaults
-airflow dags trigger bench_collect
-
-# Custom config: specify models, prompts, scenario count
-airflow dags trigger bench_collect --conf '{
-  "models": "qwen2.5:0.5b,qwen2.5:1.5b",
-  "prompts": "v1,v2-mentor",
-  "n_tips": 5,
-  "n_scenarios": 2,
-  "temperature": 0.7,
-  "experiment": "tip-bench-custom"
-}'
-```
-
-## Triggering from the Admin UI
-
-The API exposes:
-
-```
-POST /api/bench/run  { config object }
-```
-
-Admin UI → Benchmark panel → "Run Collection" button → form dialog fills config →
-POST to `/api/bench/run` → DAG triggered.
-
-## Configuration Keys
-
-| Key | Type | Default | Description |
-|-----|------|---------|-------------|
-| `models` | str | `qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b` | comma-separated Ollama tags |
-| `prompts` | str | `v1,v2-mentor,v3-few-shot` | comma-separated prompt versions |
-| `n_tips` | int | 5 | candidates to generate per scenario |
-| `n_scenarios` | int | 0 | cap scenario count (0 = all 8) |
-| `temperature` | float | 0.7 | LLM generation temperature |
-| `experiment` | str | `tip-bench-auto` | MLflow experiment name |
-| `max_model_b` | float | 4.0 | reject models larger than this (in billions) |
-| `ollama_url` | str | `http://localhost:11434` | Ollama endpoint |
-| `mlflow_url` | str | `$MLFLOW_TRACKING_URI` or `http://localhost:5000` | MLflow tracking URI |
-
-## Human-in-the-Loop Judge
-
-After `collect` finishes, `export_for_judge` produces a JSON file with all pending
-runs. The Claude Code session:
-
-1. Reads the file
-2. Scores each candidate per the rubric (relevance/actionability/tone 1–5)
-3. Runs `judge_cli.py --apply /path/to/file.json` to write scores back to MLflow
-
-Then `compare` generates the leaderboard.
-
-**Future enhancement:** Add a webhook or admin UI button to trigger the judge step
-so the entire pipeline is end-to-end in Airflow, not requiring manual Claude Code
-intervention.
-
-## Monitoring
-
- **Airflow UI**: `http://localhost:8080` → DAGs → `bench_collect` → graph view
- **MLflow UI**: `http://localhost:5000/mlflow` → experiments → `tip-bench-*`
- **Admin API**: `GET /api/bench/leaderboard/tip-bench-auto` → JSON leaderboard
-
-## Future: Admin UI Panel
-
-`apps/admin/src/components/BenchPanel.tsx` (TBD):
- List experiments
- Trigger DAG with form (models, prompts, scenario count, temperature)
- Display current DAG run status
- Show leaderboard once `compare` completes
--- a/ml/experiments/bench/README.md
+++ b/ml/experiments/bench/README.md
@@ -77,13 +77,9 @@ keys `artifact:candidates.json`, `artifact:prompt.txt`, `artifact:raw.txt`
 (tag fallback because the MLflow server uses a file:// artifact backend
 not accessible via REST from the host).

-## Integrating with Airflow (#95)
+## Running standalone

-A future DAG `ml/pipelines/prompt_ab_eval.py` will wrap `collect.py`
-exactly as shown in the quick-start, triggered on-demand from the admin
-UI or manually. The results feed into the admin leaderboard view.
-
-For now, the pipeline is runnable standalone on any machine with:
+The pipeline runs on any machine with:
 - Ollama models ≤4B
 - MLflow tracking server
 - Python 3.10+
--- a/ml/experiments/bench/mlflow_client.py
+++ b/ml/experiments/bench/mlflow_client.py
@@ -10,8 +10,7 @@ Why not the official ``mlflow`` SDK? Two reasons specific to the oO setup:
   Pulling a 200MB SDK transitively for that is excess weight.

 All calls are synchronous httpx with explicit ``Host`` so the script can
-run from the host shell, from inside docker, or from Airflow workers
-without further config.
+run from the host shell or from inside docker without further config.
 """

 from __future__ import annotations