# Airflow Integration — `bench_collect` DAG The benchmark harness integrates with Airflow as a DAG (`ml/pipelines/bench_dag.py`) triggered on-demand from the admin UI or the CLI. ## DAG Structure Three linked tasks: 1. **`collect`** — `collect.py` generates candidates per (model × prompt × scenario) cell, logs MLflow runs with `judge_pending=true`. Rejects models >4B, uses `keep_alive=0` for RAM safety. 2. **`export_for_judge`** — `judge_cli.py --export` pulls pending runs into a single JSON file for Claude Code to score per the rubric. XCom-pushes the path so the next task can find it. 3. **`compare`** — `compare.py` aggregates scores by (model, prompt) cell and generates the leaderboard ranked by composite score. ## Triggering from the CLI ```bash # Minimal: use all defaults airflow dags trigger bench_collect # Custom config: specify models, prompts, scenario count airflow dags trigger bench_collect --conf '{ "models": "qwen2.5:0.5b,qwen2.5:1.5b", "prompts": "v1,v2-mentor", "n_tips": 5, "n_scenarios": 2, "temperature": 0.7, "experiment": "tip-bench-custom" }' ``` ## Triggering from the Admin UI The API exposes: ``` POST /api/bench/run { config object } ``` Admin UI → Benchmark panel → "Run Collection" button → form dialog fills config → POST to `/api/bench/run` → DAG triggered. ## Configuration Keys | Key | Type | Default | Description | |-----|------|---------|-------------| | `models` | str | `qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b` | comma-separated Ollama tags | | `prompts` | str | `v1,v2-mentor,v3-few-shot` | comma-separated prompt versions | | `n_tips` | int | 5 | candidates to generate per scenario | | `n_scenarios` | int | 0 | cap scenario count (0 = all 8) | | `temperature` | float | 0.7 | LLM generation temperature | | `experiment` | str | `tip-bench-auto` | MLflow experiment name | | `max_model_b` | float | 4.0 | reject models larger than this (in billions) | | `ollama_url` | str | `http://localhost:11434` | Ollama endpoint | | `mlflow_url` | str | `$MLFLOW_TRACKING_URI` or `http://localhost:5000` | MLflow tracking URI | ## Human-in-the-Loop Judge After `collect` finishes, `export_for_judge` produces a JSON file with all pending runs. The Claude Code session: 1. Reads the file 2. Scores each candidate per the rubric (relevance/actionability/tone 1–5) 3. Runs `judge_cli.py --apply /path/to/file.json` to write scores back to MLflow Then `compare` generates the leaderboard. **Future enhancement:** Add a webhook or admin UI button to trigger the judge step so the entire pipeline is end-to-end in Airflow, not requiring manual Claude Code intervention. ## Monitoring - **Airflow UI**: `http://localhost:8080` → DAGs → `bench_collect` → graph view - **MLflow UI**: `http://localhost:5000/mlflow` → experiments → `tip-bench-*` - **Admin API**: `GET /api/bench/leaderboard/tip-bench-auto` → JSON leaderboard ## Future: Admin UI Panel `apps/admin/src/components/BenchPanel.tsx` (TBD): - List experiments - Trigger DAG with form (models, prompts, scenario count, temperature) - Display current DAG run status - Show leaderboard once `compare` completes