Combines model evaluation (#93) and prompt A/B testing (#95) into one experiment. Evaluates all (model × prompt × scenario) cells on the same fixed contexts so quality differences are attributable. Architecture: - Phase A (collect.py): generates candidates per cell, logs to MLflow with judge_pending=true. Rejects models >4B, uses keep_alive=0 for RAM safety (no concurrent model weights in VRAM). - Phase B (judge_cli.py): exports pending runs as JSON for Claude Code to score per the rubric, then applies scores back to MLflow. - Phase C (compare.py): leaderboard by (model, prompt) cell. Rubric (tip-v1) defines 1–5 scales for relevance, actionability, tone, plus format_ok and overlong flags. Composite = rel + act + tone + 2×format_ok − overlong. Rubric is self-describing and persisted in every run so judges use consistent criteria across sessions. Artifacts (prompts, candidates, raw responses) stored as MLflow tags because the server uses a file:// backend not accessible via REST. Full artifacts accessible in MLflow UI → run → Tags section. Tested end-to-end on local machine: - 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) ≤4B - 3 prompts (v1, v2-mentor, v3-few-shot) - 4 scenarios (4 personas × 2 time-slots) - 48 cells total, all judged and ranked Winner: qwen2.5:1.5b × v3-few-shot (composite=12.75). Ready for integration into Airflow prompt_ab_eval DAG and admin UI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
89
ml/experiments/bench/README.md
Normal file
89
ml/experiments/bench/README.md
Normal file
@@ -0,0 +1,89 @@
|
||||
# `bench/` — combined model + prompt evaluation harness
|
||||
|
||||
Combines the work of issues **#93** (model benchmark) and **#95** (prompt
|
||||
A/B) into one MLflow-tracked experiment. Each evaluation cell is one
|
||||
``(model × prompt_version × scenario)`` triple; we vary models and prompt
|
||||
versions on the same fixed scenario set so quality differences are
|
||||
attributable rather than confounded.
|
||||
|
||||
## Pieces
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `rubric.md` | The scoring rubric (`tip-v1`). Anchor for the human judge across sessions. |
|
||||
| `scenarios.py` | Deterministic ``(persona × time-slot × tasks)`` contexts; same input across all cells. |
|
||||
| `mlflow_client.py` | Thin httpx-based MLflow REST wrapper. Handles the local ``--allowed-hosts`` quirk and the file-only artifact backend. |
|
||||
| `collect.py` | **Phase A.** Generates candidates per cell, logs MLflow runs with `judge_pending=true`. |
|
||||
| `judge_cli.py` | **Phase B.** `--export` pulls pending runs into one JSON file; the Claude Code session fills in scores; `--apply` writes them back. |
|
||||
| `compare.py` | **Phase C.** Leaderboard per ``(model, prompt)`` cell. |
|
||||
|
||||
## RAM safety (#93 hard requirement)
|
||||
|
||||
* Models > 4B are **rejected up front** by `collect.py --max-model-b 4.0`.
|
||||
* Calls to Ollama include ``keep_alive=0``, which unloads the model from
|
||||
VRAM as soon as the response returns. We never hold two LLM weights
|
||||
concurrently.
|
||||
* No mock/embedded judges hold weights either: the human judge is the
|
||||
Claude Code session, RAM cost zero.
|
||||
|
||||
The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end
|
||||
to end without paging.
|
||||
|
||||
## Quick start
|
||||
|
||||
```bash
|
||||
# 1. Generate candidates for the (model × prompt) grid
|
||||
python ml/experiments/bench/collect.py \
|
||||
--models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \
|
||||
--prompts v1,v2-mentor,v3-few-shot \
|
||||
--experiment tip-bench-2026-04-27 \
|
||||
--n-tips 5 \
|
||||
--diversity
|
||||
|
||||
# 2. Export pending runs for Claude Code to score
|
||||
python ml/experiments/bench/judge_cli.py \
|
||||
--experiment tip-bench-2026-04-27 \
|
||||
--export /tmp/oo-bench-judge.json
|
||||
|
||||
# 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.)
|
||||
|
||||
# 4. Push scores back to MLflow
|
||||
python ml/experiments/bench/judge_cli.py \
|
||||
--experiment tip-bench-2026-04-27 \
|
||||
--apply /tmp/oo-bench-judge.json
|
||||
|
||||
# 5. Leaderboard
|
||||
python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27
|
||||
```
|
||||
|
||||
## Why the rubric matters
|
||||
|
||||
Different judging sessions need to be comparable. `rubric.md` pins down
|
||||
what ``relevance=4`` means with calibrated examples, so a tip scored 4
|
||||
today is equivalent to a tip scored 4 next week. Without the rubric, the
|
||||
"lazy human-in-the-loop" judge drifts.
|
||||
|
||||
## Accessing results in MLflow
|
||||
|
||||
Each run's quality scores (relevance, actionability, tone, composite) are
|
||||
stored as **metrics** on the MLflow run — accessible via:
|
||||
|
||||
1. **MLflow UI**: experiment `tip-bench-2026-04-27` → click any run → **Metrics** section
|
||||
2. **Leaderboard**: `python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27`
|
||||
3. **Raw API**: `mlflow_client.search_runs()` filters and pulls metrics in bulk
|
||||
|
||||
Candidate tips, prompts, and raw responses are stored as **tags** with
|
||||
keys `artifact:candidates.json`, `artifact:prompt.txt`, `artifact:raw.txt`
|
||||
(tag fallback because the MLflow server uses a file:// artifact backend
|
||||
not accessible via REST from the host).
|
||||
|
||||
## Integrating with Airflow (#95)
|
||||
|
||||
A future DAG `ml/pipelines/prompt_ab_eval.py` will wrap `collect.py`
|
||||
exactly as shown in the quick-start, triggered on-demand from the admin
|
||||
UI or manually. The results feed into the admin leaderboard view.
|
||||
|
||||
For now, the pipeline is runnable standalone on any machine with:
|
||||
- Ollama models ≤4B
|
||||
- MLflow tracking server
|
||||
- Python 3.10+
|
||||
Reference in New Issue
Block a user