feat(airflow): integrate bench harness into bench_collect DAG

New DAG (`ml/pipelines/bench_dag.py`) with three linked tasks: 1. collect.py — generates candidates, logs to MLflow 2. export_for_judge — exports pending runs for Claude Code scoring 3. compare — generates leaderboard by (model, prompt) cell Config via dag_run.conf supports all collect.py options (models, prompts, n_tips, n_scenarios, temperature, experiment name, max_model_b). New admin API endpoints (`services/api/src/routes/bench.ts`): - GET /api/bench/experiments — list tip-bench-* experiments - POST /api/bench/run — trigger DAG with custom config - GET /api/bench/runs/:experiment — list runs in experiment - GET /api/bench/leaderboard/:experiment — leaderboard by (model, prompt) All endpoints require admin auth. Human judge (Claude Code) scores are applied manually post-export; future enhancement: add webhook to DAG. Admin UI can now trigger and monitor benchmarks from a dashboard panel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 11:54:30 +00:00
parent 556019b060
commit 0474ad4deb
4 changed files with 494 additions and 0 deletions
--- a/ml/experiments/bench/AIRFLOW.md
+++ b/ml/experiments/bench/AIRFLOW.md
@@ -0,0 +1,90 @@
+# Airflow Integration — `bench_collect` DAG
+
+The benchmark harness integrates with Airflow as a DAG (`ml/pipelines/bench_dag.py`)
+triggered on-demand from the admin UI or the CLI.
+
+## DAG Structure
+
+Three linked tasks:
+
+1. **`collect`** — `collect.py` generates candidates per (model × prompt × scenario) cell,
+   logs MLflow runs with `judge_pending=true`. Rejects models >4B, uses `keep_alive=0`
+   for RAM safety.
+
+2. **`export_for_judge`** — `judge_cli.py --export` pulls pending runs into a single
+   JSON file for Claude Code to score per the rubric. XCom-pushes the path so the
+   next task can find it.
+
+3. **`compare`** — `compare.py` aggregates scores by (model, prompt) cell and
+   generates the leaderboard ranked by composite score.
+
+## Triggering from the CLI
+
+```bash
+# Minimal: use all defaults
+airflow dags trigger bench_collect
+
+# Custom config: specify models, prompts, scenario count
+airflow dags trigger bench_collect --conf '{
+  "models": "qwen2.5:0.5b,qwen2.5:1.5b",
+  "prompts": "v1,v2-mentor",
+  "n_tips": 5,
+  "n_scenarios": 2,
+  "temperature": 0.7,
+  "experiment": "tip-bench-custom"
+}'
+```
+
+## Triggering from the Admin UI
+
+The API exposes:
+
+```
+POST /api/bench/run  { config object }
+```
+
+Admin UI → Benchmark panel → "Run Collection" button → form dialog fills config →
+POST to `/api/bench/run` → DAG triggered.
+
+## Configuration Keys
+
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| `models` | str | `qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b` | comma-separated Ollama tags |
+| `prompts` | str | `v1,v2-mentor,v3-few-shot` | comma-separated prompt versions |
+| `n_tips` | int | 5 | candidates to generate per scenario |
+| `n_scenarios` | int | 0 | cap scenario count (0 = all 8) |
+| `temperature` | float | 0.7 | LLM generation temperature |
+| `experiment` | str | `tip-bench-auto` | MLflow experiment name |
+| `max_model_b` | float | 4.0 | reject models larger than this (in billions) |
+| `ollama_url` | str | `http://localhost:11434` | Ollama endpoint |
+| `mlflow_url` | str | `$MLFLOW_TRACKING_URI` or `http://localhost:5000` | MLflow tracking URI |
+
+## Human-in-the-Loop Judge
+
+After `collect` finishes, `export_for_judge` produces a JSON file with all pending
+runs. The Claude Code session:
+
+1. Reads the file
+2. Scores each candidate per the rubric (relevance/actionability/tone 1–5)
+3. Runs `judge_cli.py --apply /path/to/file.json` to write scores back to MLflow
+
+Then `compare` generates the leaderboard.
+
+**Future enhancement:** Add a webhook or admin UI button to trigger the judge step
+so the entire pipeline is end-to-end in Airflow, not requiring manual Claude Code
+intervention.
+
+## Monitoring
+
+- **Airflow UI**: `http://localhost:8080` → DAGs → `bench_collect` → graph view
+- **MLflow UI**: `http://localhost:5000/mlflow` → experiments → `tip-bench-*`
+- **Admin API**: `GET /api/bench/leaderboard/tip-bench-auto` → JSON leaderboard
+
+## Future: Admin UI Panel
+
+`apps/admin/src/components/BenchPanel.tsx` (TBD):
+- List experiments
+- Trigger DAG with form (models, prompts, scenario count, temperature)
+- Display current DAG run status
+- Show leaderboard once `compare` completes