"""oO tip-generation benchmark harness.

Combines model evaluation (#93) and prompt A/B testing (#95) into one
MLflow-tracked experiment. Each evaluation cell is one (model × prompt ×
scenario) triple; we vary models and prompts on the same fixed scenario
set so quality differences are attributable rather than confounded.

The pipeline follows the lazy-judge pattern: collect candidates with
deterministic metrics (latency, format_ok), export to a JSON file for
Claude Code to score per the rubric, apply scores back to MLflow, and
generate a leaderboard.

RAM safety is enforced: models >4B are rejected, Ollama calls use
keep_alive=0 to unload VRAM immediately, and the human judge (Claude Code
session) has zero inference cost.

See README.md for usage.
"""