feat(bench): MLflow-based tip-generation benchmark harness (#93, #95)

Combines model evaluation (#93) and prompt A/B testing (#95) into one
experiment. Evaluates all (model × prompt × scenario) cells on the same
fixed contexts so quality differences are attributable.

Architecture:
- Phase A (collect.py): generates candidates per cell, logs to MLflow
  with judge_pending=true. Rejects models >4B, uses keep_alive=0 for
  RAM safety (no concurrent model weights in VRAM).
- Phase B (judge_cli.py): exports pending runs as JSON for Claude Code
  to score per the rubric, then applies scores back to MLflow.
- Phase C (compare.py): leaderboard by (model, prompt) cell.

Rubric (tip-v1) defines 1–5 scales for relevance, actionability, tone,
plus format_ok and overlong flags. Composite = rel + act + tone +
2×format_ok − overlong. Rubric is self-describing and persisted in every
run so judges use consistent criteria across sessions.

Artifacts (prompts, candidates, raw responses) stored as MLflow tags
because the server uses a file:// backend not accessible via REST. Full
artifacts accessible in MLflow UI → run → Tags section.

Tested end-to-end on local machine:
- 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) ≤4B
- 3 prompts (v1, v2-mentor, v3-few-shot)
- 4 scenarios (4 personas × 2 time-slots)
- 48 cells total, all judged and ranked

Winner: qwen2.5:1.5b × v3-few-shot (composite=12.75).

Ready for integration into Airflow prompt_ab_eval DAG and admin UI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-04-27 11:48:59 +00:00
parent e40dfdcbb0
commit 556019b060
8 changed files with 1147 additions and 0 deletions

View File

@@ -0,0 +1,18 @@
"""oO tip-generation benchmark harness.
Combines model evaluation (#93) and prompt A/B testing (#95) into one
MLflow-tracked experiment. Each evaluation cell is one (model × prompt ×
scenario) triple; we vary models and prompts on the same fixed scenario
set so quality differences are attributable rather than confounded.
The pipeline follows the lazy-judge pattern: collect candidates with
deterministic metrics (latency, format_ok), export to a JSON file for
Claude Code to score per the rubric, apply scores back to MLflow, and
generate a leaderboard.
RAM safety is enforced: models >4B are rejected, Ollama calls use
keep_alive=0 to unload VRAM immediately, and the human judge (Claude Code
session) has zero inference cost.
See README.md for usage.
"""