oO/ml/experiments/bench/AIRFLOW.md

# Airflow Integration — `bench_collect` DAG

The benchmark harness integrates with Airflow as a DAG (`ml/pipelines/bench_dag.py`)
triggered on-demand from the admin UI or the CLI.

## DAG Structure

Three linked tasks:

1. **`collect`** — `collect.py` generates candidates per (model × prompt × scenario) cell,
   logs MLflow runs with `judge_pending=true`. Rejects models >4B, uses `keep_alive=0`
   for RAM safety.

2. **`export_for_judge`** — `judge_cli.py --export` pulls pending runs into a single
   JSON file for Claude Code to score per the rubric. XCom-pushes the path so the
   next task can find it.

3. **`compare`** — `compare.py` aggregates scores by (model, prompt) cell and
   generates the leaderboard ranked by composite score.

## Triggering from the CLI

```bash
# Minimal: use all defaults
airflow dags trigger bench_collect

# Custom config: specify models, prompts, scenario count
airflow dags trigger bench_collect --conf '{
  "models": "qwen2.5:0.5b,qwen2.5:1.5b",
  "prompts": "v1,v2-mentor",
  "n_tips": 5,
  "n_scenarios": 2,
  "temperature": 0.7,
  "experiment": "tip-bench-custom"
}'
```

## Triggering from the Admin UI

The API exposes:

```
POST /api/bench/run  { config object }
```

Admin UI → Benchmark panel → "Run Collection" button → form dialog fills config →
POST to `/api/bench/run` → DAG triggered.

## Configuration Keys

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `models` | str | `qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b` | comma-separated Ollama tags |
| `prompts` | str | `v1,v2-mentor,v3-few-shot` | comma-separated prompt versions |
| `n_tips` | int | 5 | candidates to generate per scenario |
| `n_scenarios` | int | 0 | cap scenario count (0 = all 8) |
| `temperature` | float | 0.7 | LLM generation temperature |
| `experiment` | str | `tip-bench-auto` | MLflow experiment name |
| `max_model_b` | float | 4.0 | reject models larger than this (in billions) |
| `ollama_url` | str | `http://localhost:11434` | Ollama endpoint |
| `mlflow_url` | str | `$MLFLOW_TRACKING_URI` or `http://localhost:5000` | MLflow tracking URI |

## Human-in-the-Loop Judge

After `collect` finishes, `export_for_judge` produces a JSON file with all pending
runs. The Claude Code session:

1. Reads the file
2. Scores each candidate per the rubric (relevance/actionability/tone 1–5)
3. Runs `judge_cli.py --apply /path/to/file.json` to write scores back to MLflow

Then `compare` generates the leaderboard.

**Future enhancement:** Add a webhook or admin UI button to trigger the judge step
so the entire pipeline is end-to-end in Airflow, not requiring manual Claude Code
intervention.

## Monitoring

- **Airflow UI**: `http://localhost:8080` → DAGs → `bench_collect` → graph view
- **MLflow UI**: `http://localhost:5000/mlflow` → experiments → `tip-bench-*`
- **Admin API**: `GET /api/bench/leaderboard/tip-bench-auto` → JSON leaderboard

## Future: Admin UI Panel

`apps/admin/src/components/BenchPanel.tsx` (TBD):
- List experiments
- Trigger DAG with form (models, prompts, scenario count, temperature)
- Display current DAG run status
- Show leaderboard once `compare` completes