Files
oO/ml/experiments/bench
alvis 556019b060 feat(bench): MLflow-based tip-generation benchmark harness (#93, #95)
Combines model evaluation (#93) and prompt A/B testing (#95) into one
experiment. Evaluates all (model × prompt × scenario) cells on the same
fixed contexts so quality differences are attributable.

Architecture:
- Phase A (collect.py): generates candidates per cell, logs to MLflow
  with judge_pending=true. Rejects models >4B, uses keep_alive=0 for
  RAM safety (no concurrent model weights in VRAM).
- Phase B (judge_cli.py): exports pending runs as JSON for Claude Code
  to score per the rubric, then applies scores back to MLflow.
- Phase C (compare.py): leaderboard by (model, prompt) cell.

Rubric (tip-v1) defines 1–5 scales for relevance, actionability, tone,
plus format_ok and overlong flags. Composite = rel + act + tone +
2×format_ok − overlong. Rubric is self-describing and persisted in every
run so judges use consistent criteria across sessions.

Artifacts (prompts, candidates, raw responses) stored as MLflow tags
because the server uses a file:// backend not accessible via REST. Full
artifacts accessible in MLflow UI → run → Tags section.

Tested end-to-end on local machine:
- 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) ≤4B
- 3 prompts (v1, v2-mentor, v3-few-shot)
- 4 scenarios (4 personas × 2 time-slots)
- 48 cells total, all judged and ranked

Winner: qwen2.5:1.5b × v3-few-shot (composite=12.75).

Ready for integration into Airflow prompt_ab_eval DAG and admin UI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 11:48:59 +00:00
..

bench/ — combined model + prompt evaluation harness

Combines the work of issues #93 (model benchmark) and #95 (prompt A/B) into one MLflow-tracked experiment. Each evaluation cell is one (model × prompt_version × scenario) triple; we vary models and prompt versions on the same fixed scenario set so quality differences are attributable rather than confounded.

Pieces

File Purpose
rubric.md The scoring rubric (tip-v1). Anchor for the human judge across sessions.
scenarios.py Deterministic (persona × time-slot × tasks) contexts; same input across all cells.
mlflow_client.py Thin httpx-based MLflow REST wrapper. Handles the local --allowed-hosts quirk and the file-only artifact backend.
collect.py Phase A. Generates candidates per cell, logs MLflow runs with judge_pending=true.
judge_cli.py Phase B. --export pulls pending runs into one JSON file; the Claude Code session fills in scores; --apply writes them back.
compare.py Phase C. Leaderboard per (model, prompt) cell.

RAM safety (#93 hard requirement)

  • Models > 4B are rejected up front by collect.py --max-model-b 4.0.
  • Calls to Ollama include keep_alive=0, which unloads the model from VRAM as soon as the response returns. We never hold two LLM weights concurrently.
  • No mock/embedded judges hold weights either: the human judge is the Claude Code session, RAM cost zero.

The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end to end without paging.

Quick start

# 1. Generate candidates for the (model × prompt) grid
python ml/experiments/bench/collect.py \
    --models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \
    --prompts v1,v2-mentor,v3-few-shot \
    --experiment tip-bench-2026-04-27 \
    --n-tips 5 \
    --diversity

# 2. Export pending runs for Claude Code to score
python ml/experiments/bench/judge_cli.py \
    --experiment tip-bench-2026-04-27 \
    --export /tmp/oo-bench-judge.json

# 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.)

# 4. Push scores back to MLflow
python ml/experiments/bench/judge_cli.py \
    --experiment tip-bench-2026-04-27 \
    --apply /tmp/oo-bench-judge.json

# 5. Leaderboard
python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27

Why the rubric matters

Different judging sessions need to be comparable. rubric.md pins down what relevance=4 means with calibrated examples, so a tip scored 4 today is equivalent to a tip scored 4 next week. Without the rubric, the "lazy human-in-the-loop" judge drifts.

Accessing results in MLflow

Each run's quality scores (relevance, actionability, tone, composite) are stored as metrics on the MLflow run — accessible via:

  1. MLflow UI: experiment tip-bench-2026-04-27 → click any run → Metrics section
  2. Leaderboard: python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27
  3. Raw API: mlflow_client.search_runs() filters and pulls metrics in bulk

Candidate tips, prompts, and raw responses are stored as tags with keys artifact:candidates.json, artifact:prompt.txt, artifact:raw.txt (tag fallback because the MLflow server uses a file:// artifact backend not accessible via REST from the host).

Integrating with Airflow (#95)

A future DAG ml/pipelines/prompt_ab_eval.py will wrap collect.py exactly as shown in the quick-start, triggered on-demand from the admin UI or manually. The results feed into the admin leaderboard view.

For now, the pipeline is runnable standalone on any machine with:

  • Ollama models ≤4B
  • MLflow tracking server
  • Python 3.10+