"""oO tip-generation benchmark harness. Combines model evaluation (#93) and prompt A/B testing (#95) into one MLflow-tracked experiment. Each evaluation cell is one (model × prompt × scenario) triple; we vary models and prompts on the same fixed scenario set so quality differences are attributable rather than confounded. The pipeline follows the lazy-judge pattern: collect candidates with deterministic metrics (latency, format_ok), export to a JSON file for Claude Code to score per the rubric, apply scores back to MLflow, and generate a leaderboard. RAM safety is enforced: models >4B are rejected, Ollama calls use keep_alive=0 to unload VRAM immediately, and the human judge (Claude Code session) has zero inference cost. See README.md for usage. """