oO/ml/experiments/bench/README.md

# `bench/` — combined model + prompt evaluation harness

Combines the work of issues **#93** (model benchmark) and **#95** (prompt
A/B) into one MLflow-tracked experiment. Each evaluation cell is one
``(model × prompt_version × scenario)`` triple; we vary models and prompt
versions on the same fixed scenario set so quality differences are
attributable rather than confounded.

## Pieces

| File | Purpose |
|------|---------|
| `rubric.md`         | The scoring rubric (`tip-v1`). Anchor for the human judge across sessions. |
| `scenarios.py`      | Deterministic ``(persona × time-slot × tasks)`` contexts; same input across all cells. |
| `mlflow_client.py`  | Thin httpx-based MLflow REST wrapper. Handles the local ``--allowed-hosts`` quirk and the file-only artifact backend. |
| `collect.py`        | **Phase A.** Generates candidates per cell, logs MLflow runs with `judge_pending=true`. |
| `judge_cli.py`      | **Phase B.** `--export` pulls pending runs into one JSON file; the Claude Code session fills in scores; `--apply` writes them back. |
| `compare.py`        | **Phase C.** Leaderboard per ``(model, prompt)`` cell. |

## RAM safety (#93 hard requirement)

* Models > 4B are **rejected up front** by `collect.py --max-model-b 4.0`.
* Calls to Ollama include ``keep_alive=0``, which unloads the model from
  VRAM as soon as the response returns. We never hold two LLM weights
  concurrently.
* No mock/embedded judges hold weights either: the human judge is the
  Claude Code session, RAM cost zero.

The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end
to end without paging.

## Quick start

```bash
# 1. Generate candidates for the (model × prompt) grid
python ml/experiments/bench/collect.py \
    --models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \
    --prompts v1,v2-mentor,v3-few-shot \
    --experiment tip-bench-2026-04-27 \
    --n-tips 5 \
    --diversity

# 2. Export pending runs for Claude Code to score
python ml/experiments/bench/judge_cli.py \
    --experiment tip-bench-2026-04-27 \
    --export /tmp/oo-bench-judge.json

# 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.)

# 4. Push scores back to MLflow
python ml/experiments/bench/judge_cli.py \
    --experiment tip-bench-2026-04-27 \
    --apply /tmp/oo-bench-judge.json

# 5. Leaderboard
python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27
```

## Why the rubric matters

Different judging sessions need to be comparable. `rubric.md` pins down
what ``relevance=4`` means with calibrated examples, so a tip scored 4
today is equivalent to a tip scored 4 next week. Without the rubric, the
"lazy human-in-the-loop" judge drifts.

## Accessing results in MLflow

Each run's quality scores (relevance, actionability, tone, composite) are
stored as **metrics** on the MLflow run — accessible via:

1. **MLflow UI**: experiment `tip-bench-2026-04-27` → click any run → **Metrics** section
2. **Leaderboard**: `python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27`
3. **Raw API**: `mlflow_client.search_runs()` filters and pulls metrics in bulk

Candidate tips, prompts, and raw responses are stored as **tags** with
keys `artifact:candidates.json`, `artifact:prompt.txt`, `artifact:raw.txt`
(tag fallback because the MLflow server uses a file:// artifact backend
not accessible via REST from the host).

## Integrating with Airflow (#95)

A future DAG `ml/pipelines/prompt_ab_eval.py` will wrap `collect.py`
exactly as shown in the quick-start, triggered on-demand from the admin
UI or manually. The results feed into the admin leaderboard view.

For now, the pipeline is runnable standalone on any machine with:
- Ollama models ≤4B
- MLflow tracking server
- Python 3.10+