oO/ml/experiments at c1f5fcb561fd9ae13e54fa3df893dfb62dfd1791 - oO

alvis/oO

Files

alvis 0474ad4deb feat(airflow): integrate bench harness into bench_collect DAG

New DAG (`ml/pipelines/bench_dag.py`) with three linked tasks:
1. collect.py — generates candidates, logs to MLflow
2. export_for_judge — exports pending runs for Claude Code scoring
3. compare — generates leaderboard by (model, prompt) cell

Config via dag_run.conf supports all collect.py options (models, prompts,
n_tips, n_scenarios, temperature, experiment name, max_model_b).

New admin API endpoints (`services/api/src/routes/bench.ts`):
- GET /api/bench/experiments — list tip-bench-* experiments
- POST /api/bench/run — trigger DAG with custom config
- GET /api/bench/runs/:experiment — list runs in experiment
- GET /api/bench/leaderboard/:experiment — leaderboard by (model, prompt)

All endpoints require admin auth. Human judge (Claude Code) scores are
applied manually post-export; future enhancement: add webhook to DAG.

Admin UI can now trigger and monitor benchmarks from a dashboard panel.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-27 11:54:30 +00:00

bench

feat(airflow): integrate bench harness into bench_collect DAG

2026-04-27 11:54:30 +00:00

sim

feat(simulate): MLflow tracking, Airflow DAG integration, health checks for mlflow/airflow

2026-04-26 12:08:36 +00:00

.gitkeep

chore: scaffold oO monorepo with architecture, roadmap, and module stubs

2026-04-13 14:19:56 +00:00