Files
oO/ml/experiments/bench/README.md
alvis 556019b060 feat(bench): MLflow-based tip-generation benchmark harness (#93, #95)
Combines model evaluation (#93) and prompt A/B testing (#95) into one
experiment. Evaluates all (model × prompt × scenario) cells on the same
fixed contexts so quality differences are attributable.

Architecture:
- Phase A (collect.py): generates candidates per cell, logs to MLflow
  with judge_pending=true. Rejects models >4B, uses keep_alive=0 for
  RAM safety (no concurrent model weights in VRAM).
- Phase B (judge_cli.py): exports pending runs as JSON for Claude Code
  to score per the rubric, then applies scores back to MLflow.
- Phase C (compare.py): leaderboard by (model, prompt) cell.

Rubric (tip-v1) defines 1–5 scales for relevance, actionability, tone,
plus format_ok and overlong flags. Composite = rel + act + tone +
2×format_ok − overlong. Rubric is self-describing and persisted in every
run so judges use consistent criteria across sessions.

Artifacts (prompts, candidates, raw responses) stored as MLflow tags
because the server uses a file:// backend not accessible via REST. Full
artifacts accessible in MLflow UI → run → Tags section.

Tested end-to-end on local machine:
- 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) ≤4B
- 3 prompts (v1, v2-mentor, v3-few-shot)
- 4 scenarios (4 personas × 2 time-slots)
- 48 cells total, all judged and ranked

Winner: qwen2.5:1.5b × v3-few-shot (composite=12.75).

Ready for integration into Airflow prompt_ab_eval DAG and admin UI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 11:48:59 +00:00

90 lines
3.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `bench/` — combined model + prompt evaluation harness
Combines the work of issues **#93** (model benchmark) and **#95** (prompt
A/B) into one MLflow-tracked experiment. Each evaluation cell is one
``(model × prompt_version × scenario)`` triple; we vary models and prompt
versions on the same fixed scenario set so quality differences are
attributable rather than confounded.
## Pieces
| File | Purpose |
|------|---------|
| `rubric.md` | The scoring rubric (`tip-v1`). Anchor for the human judge across sessions. |
| `scenarios.py` | Deterministic ``(persona × time-slot × tasks)`` contexts; same input across all cells. |
| `mlflow_client.py` | Thin httpx-based MLflow REST wrapper. Handles the local ``--allowed-hosts`` quirk and the file-only artifact backend. |
| `collect.py` | **Phase A.** Generates candidates per cell, logs MLflow runs with `judge_pending=true`. |
| `judge_cli.py` | **Phase B.** `--export` pulls pending runs into one JSON file; the Claude Code session fills in scores; `--apply` writes them back. |
| `compare.py` | **Phase C.** Leaderboard per ``(model, prompt)`` cell. |
## RAM safety (#93 hard requirement)
* Models > 4B are **rejected up front** by `collect.py --max-model-b 4.0`.
* Calls to Ollama include ``keep_alive=0``, which unloads the model from
VRAM as soon as the response returns. We never hold two LLM weights
concurrently.
* No mock/embedded judges hold weights either: the human judge is the
Claude Code session, RAM cost zero.
The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end
to end without paging.
## Quick start
```bash
# 1. Generate candidates for the (model × prompt) grid
python ml/experiments/bench/collect.py \
--models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \
--prompts v1,v2-mentor,v3-few-shot \
--experiment tip-bench-2026-04-27 \
--n-tips 5 \
--diversity
# 2. Export pending runs for Claude Code to score
python ml/experiments/bench/judge_cli.py \
--experiment tip-bench-2026-04-27 \
--export /tmp/oo-bench-judge.json
# 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.)
# 4. Push scores back to MLflow
python ml/experiments/bench/judge_cli.py \
--experiment tip-bench-2026-04-27 \
--apply /tmp/oo-bench-judge.json
# 5. Leaderboard
python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27
```
## Why the rubric matters
Different judging sessions need to be comparable. `rubric.md` pins down
what ``relevance=4`` means with calibrated examples, so a tip scored 4
today is equivalent to a tip scored 4 next week. Without the rubric, the
"lazy human-in-the-loop" judge drifts.
## Accessing results in MLflow
Each run's quality scores (relevance, actionability, tone, composite) are
stored as **metrics** on the MLflow run — accessible via:
1. **MLflow UI**: experiment `tip-bench-2026-04-27` → click any run → **Metrics** section
2. **Leaderboard**: `python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27`
3. **Raw API**: `mlflow_client.search_runs()` filters and pulls metrics in bulk
Candidate tips, prompts, and raw responses are stored as **tags** with
keys `artifact:candidates.json`, `artifact:prompt.txt`, `artifact:raw.txt`
(tag fallback because the MLflow server uses a file:// artifact backend
not accessible via REST from the host).
## Integrating with Airflow (#95)
A future DAG `ml/pipelines/prompt_ab_eval.py` will wrap `collect.py`
exactly as shown in the quick-start, triggered on-demand from the admin
UI or manually. The results feed into the admin leaderboard view.
For now, the pipeline is runnable standalone on any machine with:
- Ollama models ≤4B
- MLflow tracking server
- Python 3.10+