New DAG (`ml/pipelines/bench_dag.py`) with three linked tasks: 1. collect.py — generates candidates, logs to MLflow 2. export_for_judge — exports pending runs for Claude Code scoring 3. compare — generates leaderboard by (model, prompt) cell Config via dag_run.conf supports all collect.py options (models, prompts, n_tips, n_scenarios, temperature, experiment name, max_model_b). New admin API endpoints (`services/api/src/routes/bench.ts`): - GET /api/bench/experiments — list tip-bench-* experiments - POST /api/bench/run — trigger DAG with custom config - GET /api/bench/runs/:experiment — list runs in experiment - GET /api/bench/leaderboard/:experiment — leaderboard by (model, prompt) All endpoints require admin auth. Human judge (Claude Code) scores are applied manually post-export; future enhancement: add webhook to DAG. Admin UI can now trigger and monitor benchmarks from a dashboard panel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.1 KiB
Airflow Integration — bench_collect DAG
The benchmark harness integrates with Airflow as a DAG (ml/pipelines/bench_dag.py)
triggered on-demand from the admin UI or the CLI.
DAG Structure
Three linked tasks:
-
collect—collect.pygenerates candidates per (model × prompt × scenario) cell, logs MLflow runs withjudge_pending=true. Rejects models >4B, useskeep_alive=0for RAM safety. -
export_for_judge—judge_cli.py --exportpulls pending runs into a single JSON file for Claude Code to score per the rubric. XCom-pushes the path so the next task can find it. -
compare—compare.pyaggregates scores by (model, prompt) cell and generates the leaderboard ranked by composite score.
Triggering from the CLI
# Minimal: use all defaults
airflow dags trigger bench_collect
# Custom config: specify models, prompts, scenario count
airflow dags trigger bench_collect --conf '{
"models": "qwen2.5:0.5b,qwen2.5:1.5b",
"prompts": "v1,v2-mentor",
"n_tips": 5,
"n_scenarios": 2,
"temperature": 0.7,
"experiment": "tip-bench-custom"
}'
Triggering from the Admin UI
The API exposes:
POST /api/bench/run { config object }
Admin UI → Benchmark panel → "Run Collection" button → form dialog fills config →
POST to /api/bench/run → DAG triggered.
Configuration Keys
| Key | Type | Default | Description |
|---|---|---|---|
models |
str | qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b |
comma-separated Ollama tags |
prompts |
str | v1,v2-mentor,v3-few-shot |
comma-separated prompt versions |
n_tips |
int | 5 | candidates to generate per scenario |
n_scenarios |
int | 0 | cap scenario count (0 = all 8) |
temperature |
float | 0.7 | LLM generation temperature |
experiment |
str | tip-bench-auto |
MLflow experiment name |
max_model_b |
float | 4.0 | reject models larger than this (in billions) |
ollama_url |
str | http://localhost:11434 |
Ollama endpoint |
mlflow_url |
str | $MLFLOW_TRACKING_URI or http://localhost:5000 |
MLflow tracking URI |
Human-in-the-Loop Judge
After collect finishes, export_for_judge produces a JSON file with all pending
runs. The Claude Code session:
- Reads the file
- Scores each candidate per the rubric (relevance/actionability/tone 1–5)
- Runs
judge_cli.py --apply /path/to/file.jsonto write scores back to MLflow
Then compare generates the leaderboard.
Future enhancement: Add a webhook or admin UI button to trigger the judge step so the entire pipeline is end-to-end in Airflow, not requiring manual Claude Code intervention.
Monitoring
- Airflow UI:
http://localhost:8080→ DAGs →bench_collect→ graph view - MLflow UI:
http://localhost:5000/mlflow→ experiments →tip-bench-* - Admin API:
GET /api/bench/leaderboard/tip-bench-auto→ JSON leaderboard
Future: Admin UI Panel
apps/admin/src/components/BenchPanel.tsx (TBD):
- List experiments
- Trigger DAG with form (models, prompts, scenario count, temperature)
- Display current DAG run status
- Show leaderboard once
comparecompletes