Files
oO/ml/experiments/bench/AIRFLOW.md
alvis 0474ad4deb feat(airflow): integrate bench harness into bench_collect DAG
New DAG (`ml/pipelines/bench_dag.py`) with three linked tasks:
1. collect.py — generates candidates, logs to MLflow
2. export_for_judge — exports pending runs for Claude Code scoring
3. compare — generates leaderboard by (model, prompt) cell

Config via dag_run.conf supports all collect.py options (models, prompts,
n_tips, n_scenarios, temperature, experiment name, max_model_b).

New admin API endpoints (`services/api/src/routes/bench.ts`):
- GET /api/bench/experiments — list tip-bench-* experiments
- POST /api/bench/run — trigger DAG with custom config
- GET /api/bench/runs/:experiment — list runs in experiment
- GET /api/bench/leaderboard/:experiment — leaderboard by (model, prompt)

All endpoints require admin auth. Human judge (Claude Code) scores are
applied manually post-export; future enhancement: add webhook to DAG.

Admin UI can now trigger and monitor benchmarks from a dashboard panel.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 11:54:30 +00:00

91 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Airflow Integration — `bench_collect` DAG
The benchmark harness integrates with Airflow as a DAG (`ml/pipelines/bench_dag.py`)
triggered on-demand from the admin UI or the CLI.
## DAG Structure
Three linked tasks:
1. **`collect`** — `collect.py` generates candidates per (model × prompt × scenario) cell,
logs MLflow runs with `judge_pending=true`. Rejects models >4B, uses `keep_alive=0`
for RAM safety.
2. **`export_for_judge`** — `judge_cli.py --export` pulls pending runs into a single
JSON file for Claude Code to score per the rubric. XCom-pushes the path so the
next task can find it.
3. **`compare`** — `compare.py` aggregates scores by (model, prompt) cell and
generates the leaderboard ranked by composite score.
## Triggering from the CLI
```bash
# Minimal: use all defaults
airflow dags trigger bench_collect
# Custom config: specify models, prompts, scenario count
airflow dags trigger bench_collect --conf '{
"models": "qwen2.5:0.5b,qwen2.5:1.5b",
"prompts": "v1,v2-mentor",
"n_tips": 5,
"n_scenarios": 2,
"temperature": 0.7,
"experiment": "tip-bench-custom"
}'
```
## Triggering from the Admin UI
The API exposes:
```
POST /api/bench/run { config object }
```
Admin UI → Benchmark panel → "Run Collection" button → form dialog fills config →
POST to `/api/bench/run` → DAG triggered.
## Configuration Keys
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `models` | str | `qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b` | comma-separated Ollama tags |
| `prompts` | str | `v1,v2-mentor,v3-few-shot` | comma-separated prompt versions |
| `n_tips` | int | 5 | candidates to generate per scenario |
| `n_scenarios` | int | 0 | cap scenario count (0 = all 8) |
| `temperature` | float | 0.7 | LLM generation temperature |
| `experiment` | str | `tip-bench-auto` | MLflow experiment name |
| `max_model_b` | float | 4.0 | reject models larger than this (in billions) |
| `ollama_url` | str | `http://localhost:11434` | Ollama endpoint |
| `mlflow_url` | str | `$MLFLOW_TRACKING_URI` or `http://localhost:5000` | MLflow tracking URI |
## Human-in-the-Loop Judge
After `collect` finishes, `export_for_judge` produces a JSON file with all pending
runs. The Claude Code session:
1. Reads the file
2. Scores each candidate per the rubric (relevance/actionability/tone 15)
3. Runs `judge_cli.py --apply /path/to/file.json` to write scores back to MLflow
Then `compare` generates the leaderboard.
**Future enhancement:** Add a webhook or admin UI button to trigger the judge step
so the entire pipeline is end-to-end in Airflow, not requiring manual Claude Code
intervention.
## Monitoring
- **Airflow UI**: `http://localhost:8080` → DAGs → `bench_collect` → graph view
- **MLflow UI**: `http://localhost:5000/mlflow` → experiments → `tip-bench-*`
- **Admin API**: `GET /api/bench/leaderboard/tip-bench-auto` → JSON leaderboard
## Future: Admin UI Panel
`apps/admin/src/components/BenchPanel.tsx` (TBD):
- List experiments
- Trigger DAG with form (models, prompts, scenario count, temperature)
- Display current DAG run status
- Show leaderboard once `compare` completes