# `bench/` — combined model + prompt evaluation harness Combines the work of issues **#93** (model benchmark) and **#95** (prompt A/B) into one MLflow-tracked experiment. Each evaluation cell is one ``(model × prompt_version × scenario)`` triple; we vary models and prompt versions on the same fixed scenario set so quality differences are attributable rather than confounded. ## Pieces | File | Purpose | |------|---------| | `rubric.md` | The scoring rubric (`tip-v1`). Anchor for the human judge across sessions. | | `scenarios.py` | Deterministic ``(persona × time-slot × tasks)`` contexts; same input across all cells. | | `mlflow_client.py` | Thin httpx-based MLflow REST wrapper. Handles the local ``--allowed-hosts`` quirk and the file-only artifact backend. | | `collect.py` | **Phase A.** Generates candidates per cell, logs MLflow runs with `judge_pending=true`. | | `judge_cli.py` | **Phase B.** `--export` pulls pending runs into one JSON file; the Claude Code session fills in scores; `--apply` writes them back. | | `compare.py` | **Phase C.** Leaderboard per ``(model, prompt)`` cell. | ## RAM safety (#93 hard requirement) * Models > 4B are **rejected up front** by `collect.py --max-model-b 4.0`. * Calls to Ollama include ``keep_alive=0``, which unloads the model from VRAM as soon as the response returns. We never hold two LLM weights concurrently. * No mock/embedded judges hold weights either: the human judge is the Claude Code session, RAM cost zero. The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end to end without paging. ## Quick start ```bash # 1. Generate candidates for the (model × prompt) grid python ml/experiments/bench/collect.py \ --models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \ --prompts v1,v2-mentor,v3-few-shot \ --experiment tip-bench-2026-04-27 \ --n-tips 5 \ --diversity # 2. Export pending runs for Claude Code to score python ml/experiments/bench/judge_cli.py \ --experiment tip-bench-2026-04-27 \ --export /tmp/oo-bench-judge.json # 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.) # 4. Push scores back to MLflow python ml/experiments/bench/judge_cli.py \ --experiment tip-bench-2026-04-27 \ --apply /tmp/oo-bench-judge.json # 5. Leaderboard python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27 ``` ## Why the rubric matters Different judging sessions need to be comparable. `rubric.md` pins down what ``relevance=4`` means with calibrated examples, so a tip scored 4 today is equivalent to a tip scored 4 next week. Without the rubric, the "lazy human-in-the-loop" judge drifts. ## Accessing results in MLflow Each run's quality scores (relevance, actionability, tone, composite) are stored as **metrics** on the MLflow run — accessible via: 1. **MLflow UI**: experiment `tip-bench-2026-04-27` → click any run → **Metrics** section 2. **Leaderboard**: `python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27` 3. **Raw API**: `mlflow_client.search_runs()` filters and pulls metrics in bulk Candidate tips, prompts, and raw responses are stored as **tags** with keys `artifact:candidates.json`, `artifact:prompt.txt`, `artifact:raw.txt` (tag fallback because the MLflow server uses a file:// artifact backend not accessible via REST from the host). ## Integrating with Airflow (#95) A future DAG `ml/pipelines/prompt_ab_eval.py` will wrap `collect.py` exactly as shown in the quick-start, triggered on-demand from the admin UI or manually. The results feed into the admin leaderboard view. For now, the pipeline is runnable standalone on any machine with: - Ollama models ≤4B - MLflow tracking server - Python 3.10+