alvis/oO

Files

alvis 0474ad4deb feat(airflow): integrate bench harness into bench_collect DAG

New DAG (`ml/pipelines/bench_dag.py`) with three linked tasks:
1. collect.py — generates candidates, logs to MLflow
2. export_for_judge — exports pending runs for Claude Code scoring
3. compare — generates leaderboard by (model, prompt) cell

Config via dag_run.conf supports all collect.py options (models, prompts,
n_tips, n_scenarios, temperature, experiment name, max_model_b).

New admin API endpoints (`services/api/src/routes/bench.ts`):
- GET /api/bench/experiments — list tip-bench-* experiments
- POST /api/bench/run — trigger DAG with custom config
- GET /api/bench/runs/:experiment — list runs in experiment
- GET /api/bench/leaderboard/:experiment — leaderboard by (model, prompt)

All endpoints require admin auth. Human judge (Claude Code) scores are
applied manually post-export; future enhancement: add webhook to DAG.

Admin UI can now trigger and monitor benchmarks from a dashboard panel.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-27 11:54:30 +00:00

3.1 KiB

Raw Blame History

Airflow Integration — `bench_collect` DAG

The benchmark harness integrates with Airflow as a DAG (ml/pipelines/bench_dag.py) triggered on-demand from the admin UI or the CLI.

DAG Structure

Three linked tasks:

collect — collect.py generates candidates per (model × prompt × scenario) cell, logs MLflow runs with judge_pending=true. Rejects models >4B, uses keep_alive=0 for RAM safety.
export_for_judge — judge_cli.py --export pulls pending runs into a single JSON file for Claude Code to score per the rubric. XCom-pushes the path so the next task can find it.
compare — compare.py aggregates scores by (model, prompt) cell and generates the leaderboard ranked by composite score.

Triggering from the CLI

# Minimal: use all defaults
airflow dags trigger bench_collect

# Custom config: specify models, prompts, scenario count
airflow dags trigger bench_collect --conf '{
  "models": "qwen2.5:0.5b,qwen2.5:1.5b",
  "prompts": "v1,v2-mentor",
  "n_tips": 5,
  "n_scenarios": 2,
  "temperature": 0.7,
  "experiment": "tip-bench-custom"
}'

Triggering from the Admin UI

The API exposes:

POST /api/bench/run  { config object }

Admin UI → Benchmark panel → "Run Collection" button → form dialog fills config → POST to /api/bench/run → DAG triggered.

Configuration Keys

Key	Type	Default	Description
`models`	str	`qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b`	comma-separated Ollama tags
`prompts`	str	`v1,v2-mentor,v3-few-shot`	comma-separated prompt versions
`n_tips`	int	5	candidates to generate per scenario
`n_scenarios`	int	0	cap scenario count (0 = all 8)
`temperature`	float	0.7	LLM generation temperature
`experiment`	str	`tip-bench-auto`	MLflow experiment name
`max_model_b`	float	4.0	reject models larger than this (in billions)
`ollama_url`	str	`http://localhost:11434`	Ollama endpoint
`mlflow_url`	str	`$MLFLOW_TRACKING_URI` or `http://localhost:5000`	MLflow tracking URI

Human-in-the-Loop Judge

After collect finishes, export_for_judge produces a JSON file with all pending runs. The Claude Code session:

Reads the file
Scores each candidate per the rubric (relevance/actionability/tone 1–5)
Runs judge_cli.py --apply /path/to/file.json to write scores back to MLflow

Then compare generates the leaderboard.

Future enhancement: Add a webhook or admin UI button to trigger the judge step so the entire pipeline is end-to-end in Airflow, not requiring manual Claude Code intervention.

Monitoring

Airflow UI: http://localhost:8080 → DAGs → bench_collect → graph view
MLflow UI: http://localhost:5000/mlflow → experiments → tip-bench-*
Admin API: GET /api/bench/leaderboard/tip-bench-auto → JSON leaderboard

Future: Admin UI Panel

apps/admin/src/components/BenchPanel.tsx (TBD):

List experiments
Trigger DAG with form (models, prompts, scenario count, temperature)
Display current DAG run status
Show leaderboard once compare completes

3.1 KiB Raw Blame History Unescape Escape

Airflow Integration — bench_collect DAG