Files
oO/ml/experiments/bench/README.md
alvis f8d66aa01f chore: remove Airflow completely from the stack
Drop all four Airflow containers (db, init, webserver, scheduler) from the
mlops compose profile, leaving MLflow as the sole mlops service. Remove
AIRFLOW_* env vars, config fields, health-check entries, DAG trigger code
in admin/bench routes, the airflow_dag_run_id schema column, Airflow nav
links and DAG-run links in the admin UI, the two Airflow DAG files
(bench_dag.py, sim_dag.py), and all related docs/ADR references.
Simulations now run exclusively via the subprocess path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-03 16:38:46 +00:00

86 lines
3.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `bench/` — combined model + prompt evaluation harness
Combines the work of issues **#93** (model benchmark) and **#95** (prompt
A/B) into one MLflow-tracked experiment. Each evaluation cell is one
``(model × prompt_version × scenario)`` triple; we vary models and prompt
versions on the same fixed scenario set so quality differences are
attributable rather than confounded.
## Pieces
| File | Purpose |
|------|---------|
| `rubric.md` | The scoring rubric (`tip-v1`). Anchor for the human judge across sessions. |
| `scenarios.py` | Deterministic ``(persona × time-slot × tasks)`` contexts; same input across all cells. |
| `mlflow_client.py` | Thin httpx-based MLflow REST wrapper. Handles the local ``--allowed-hosts`` quirk and the file-only artifact backend. |
| `collect.py` | **Phase A.** Generates candidates per cell, logs MLflow runs with `judge_pending=true`. |
| `judge_cli.py` | **Phase B.** `--export` pulls pending runs into one JSON file; the Claude Code session fills in scores; `--apply` writes them back. |
| `compare.py` | **Phase C.** Leaderboard per ``(model, prompt)`` cell. |
## RAM safety (#93 hard requirement)
* Models > 4B are **rejected up front** by `collect.py --max-model-b 4.0`.
* Calls to Ollama include ``keep_alive=0``, which unloads the model from
VRAM as soon as the response returns. We never hold two LLM weights
concurrently.
* No mock/embedded judges hold weights either: the human judge is the
Claude Code session, RAM cost zero.
The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end
to end without paging.
## Quick start
```bash
# 1. Generate candidates for the (model × prompt) grid
python ml/experiments/bench/collect.py \
--models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \
--prompts v1,v2-mentor,v3-few-shot \
--experiment tip-bench-2026-04-27 \
--n-tips 5 \
--diversity
# 2. Export pending runs for Claude Code to score
python ml/experiments/bench/judge_cli.py \
--experiment tip-bench-2026-04-27 \
--export /tmp/oo-bench-judge.json
# 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.)
# 4. Push scores back to MLflow
python ml/experiments/bench/judge_cli.py \
--experiment tip-bench-2026-04-27 \
--apply /tmp/oo-bench-judge.json
# 5. Leaderboard
python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27
```
## Why the rubric matters
Different judging sessions need to be comparable. `rubric.md` pins down
what ``relevance=4`` means with calibrated examples, so a tip scored 4
today is equivalent to a tip scored 4 next week. Without the rubric, the
"lazy human-in-the-loop" judge drifts.
## Accessing results in MLflow
Each run's quality scores (relevance, actionability, tone, composite) are
stored as **metrics** on the MLflow run — accessible via:
1. **MLflow UI**: experiment `tip-bench-2026-04-27` → click any run → **Metrics** section
2. **Leaderboard**: `python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27`
3. **Raw API**: `mlflow_client.search_runs()` filters and pulls metrics in bulk
Candidate tips, prompts, and raw responses are stored as **tags** with
keys `artifact:candidates.json`, `artifact:prompt.txt`, `artifact:raw.txt`
(tag fallback because the MLflow server uses a file:// artifact backend
not accessible via REST from the host).
## Running standalone
The pipeline runs on any machine with:
- Ollama models ≤4B
- MLflow tracking server
- Python 3.10+