Drop all four Airflow containers (db, init, webserver, scheduler) from the mlops compose profile, leaving MLflow as the sole mlops service. Remove AIRFLOW_* env vars, config fields, health-check entries, DAG trigger code in admin/bench routes, the airflow_dag_run_id schema column, Airflow nav links and DAG-run links in the admin UI, the two Airflow DAG files (bench_dag.py, sim_dag.py), and all related docs/ADR references. Simulations now run exclusively via the subprocess path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
86 lines
3.4 KiB
Markdown
86 lines
3.4 KiB
Markdown
# `bench/` — combined model + prompt evaluation harness
|
||
|
||
Combines the work of issues **#93** (model benchmark) and **#95** (prompt
|
||
A/B) into one MLflow-tracked experiment. Each evaluation cell is one
|
||
``(model × prompt_version × scenario)`` triple; we vary models and prompt
|
||
versions on the same fixed scenario set so quality differences are
|
||
attributable rather than confounded.
|
||
|
||
## Pieces
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `rubric.md` | The scoring rubric (`tip-v1`). Anchor for the human judge across sessions. |
|
||
| `scenarios.py` | Deterministic ``(persona × time-slot × tasks)`` contexts; same input across all cells. |
|
||
| `mlflow_client.py` | Thin httpx-based MLflow REST wrapper. Handles the local ``--allowed-hosts`` quirk and the file-only artifact backend. |
|
||
| `collect.py` | **Phase A.** Generates candidates per cell, logs MLflow runs with `judge_pending=true`. |
|
||
| `judge_cli.py` | **Phase B.** `--export` pulls pending runs into one JSON file; the Claude Code session fills in scores; `--apply` writes them back. |
|
||
| `compare.py` | **Phase C.** Leaderboard per ``(model, prompt)`` cell. |
|
||
|
||
## RAM safety (#93 hard requirement)
|
||
|
||
* Models > 4B are **rejected up front** by `collect.py --max-model-b 4.0`.
|
||
* Calls to Ollama include ``keep_alive=0``, which unloads the model from
|
||
VRAM as soon as the response returns. We never hold two LLM weights
|
||
concurrently.
|
||
* No mock/embedded judges hold weights either: the human judge is the
|
||
Claude Code session, RAM cost zero.
|
||
|
||
The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end
|
||
to end without paging.
|
||
|
||
## Quick start
|
||
|
||
```bash
|
||
# 1. Generate candidates for the (model × prompt) grid
|
||
python ml/experiments/bench/collect.py \
|
||
--models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \
|
||
--prompts v1,v2-mentor,v3-few-shot \
|
||
--experiment tip-bench-2026-04-27 \
|
||
--n-tips 5 \
|
||
--diversity
|
||
|
||
# 2. Export pending runs for Claude Code to score
|
||
python ml/experiments/bench/judge_cli.py \
|
||
--experiment tip-bench-2026-04-27 \
|
||
--export /tmp/oo-bench-judge.json
|
||
|
||
# 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.)
|
||
|
||
# 4. Push scores back to MLflow
|
||
python ml/experiments/bench/judge_cli.py \
|
||
--experiment tip-bench-2026-04-27 \
|
||
--apply /tmp/oo-bench-judge.json
|
||
|
||
# 5. Leaderboard
|
||
python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27
|
||
```
|
||
|
||
## Why the rubric matters
|
||
|
||
Different judging sessions need to be comparable. `rubric.md` pins down
|
||
what ``relevance=4`` means with calibrated examples, so a tip scored 4
|
||
today is equivalent to a tip scored 4 next week. Without the rubric, the
|
||
"lazy human-in-the-loop" judge drifts.
|
||
|
||
## Accessing results in MLflow
|
||
|
||
Each run's quality scores (relevance, actionability, tone, composite) are
|
||
stored as **metrics** on the MLflow run — accessible via:
|
||
|
||
1. **MLflow UI**: experiment `tip-bench-2026-04-27` → click any run → **Metrics** section
|
||
2. **Leaderboard**: `python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27`
|
||
3. **Raw API**: `mlflow_client.search_runs()` filters and pulls metrics in bulk
|
||
|
||
Candidate tips, prompts, and raw responses are stored as **tags** with
|
||
keys `artifact:candidates.json`, `artifact:prompt.txt`, `artifact:raw.txt`
|
||
(tag fallback because the MLflow server uses a file:// artifact backend
|
||
not accessible via REST from the host).
|
||
|
||
## Running standalone
|
||
|
||
The pipeline runs on any machine with:
|
||
- Ollama models ≤4B
|
||
- MLflow tracking server
|
||
- Python 3.10+
|