Drop all four Airflow containers (db, init, webserver, scheduler) from the mlops compose profile, leaving MLflow as the sole mlops service. Remove AIRFLOW_* env vars, config fields, health-check entries, DAG trigger code in admin/bench routes, the airflow_dag_run_id schema column, Airflow nav links and DAG-run links in the admin UI, the two Airflow DAG files (bench_dag.py, sim_dag.py), and all related docs/ADR references. Simulations now run exclusively via the subprocess path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3.4 KiB
bench/ — combined model + prompt evaluation harness
Combines the work of issues #93 (model benchmark) and #95 (prompt
A/B) into one MLflow-tracked experiment. Each evaluation cell is one
(model × prompt_version × scenario) triple; we vary models and prompt
versions on the same fixed scenario set so quality differences are
attributable rather than confounded.
Pieces
| File | Purpose |
|---|---|
rubric.md |
The scoring rubric (tip-v1). Anchor for the human judge across sessions. |
scenarios.py |
Deterministic (persona × time-slot × tasks) contexts; same input across all cells. |
mlflow_client.py |
Thin httpx-based MLflow REST wrapper. Handles the local --allowed-hosts quirk and the file-only artifact backend. |
collect.py |
Phase A. Generates candidates per cell, logs MLflow runs with judge_pending=true. |
judge_cli.py |
Phase B. --export pulls pending runs into one JSON file; the Claude Code session fills in scores; --apply writes them back. |
compare.py |
Phase C. Leaderboard per (model, prompt) cell. |
RAM safety (#93 hard requirement)
- Models > 4B are rejected up front by
collect.py --max-model-b 4.0. - Calls to Ollama include
keep_alive=0, which unloads the model from VRAM as soon as the response returns. We never hold two LLM weights concurrently. - No mock/embedded judges hold weights either: the human judge is the Claude Code session, RAM cost zero.
The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end to end without paging.
Quick start
# 1. Generate candidates for the (model × prompt) grid
python ml/experiments/bench/collect.py \
--models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \
--prompts v1,v2-mentor,v3-few-shot \
--experiment tip-bench-2026-04-27 \
--n-tips 5 \
--diversity
# 2. Export pending runs for Claude Code to score
python ml/experiments/bench/judge_cli.py \
--experiment tip-bench-2026-04-27 \
--export /tmp/oo-bench-judge.json
# 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.)
# 4. Push scores back to MLflow
python ml/experiments/bench/judge_cli.py \
--experiment tip-bench-2026-04-27 \
--apply /tmp/oo-bench-judge.json
# 5. Leaderboard
python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27
Why the rubric matters
Different judging sessions need to be comparable. rubric.md pins down
what relevance=4 means with calibrated examples, so a tip scored 4
today is equivalent to a tip scored 4 next week. Without the rubric, the
"lazy human-in-the-loop" judge drifts.
Accessing results in MLflow
Each run's quality scores (relevance, actionability, tone, composite) are stored as metrics on the MLflow run — accessible via:
- MLflow UI: experiment
tip-bench-2026-04-27→ click any run → Metrics section - Leaderboard:
python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27 - Raw API:
mlflow_client.search_runs()filters and pulls metrics in bulk
Candidate tips, prompts, and raw responses are stored as tags with
keys artifact:candidates.json, artifact:prompt.txt, artifact:raw.txt
(tag fallback because the MLflow server uses a file:// artifact backend
not accessible via REST from the host).
Running standalone
The pipeline runs on any machine with:
- Ollama models ≤4B
- MLflow tracking server
- Python 3.10+