feat: automated prompt optimization loop — sim A/B → promote winner #95
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Goal
Automate the prompt improvement cycle via offline sim A/B testing and MLflow.
✅ COMPLETED
Implemented the lazy-judge pattern for prompt A/B evaluation, combined with model benchmarking in issue #93.
How it works
Results
tip-bench-2026-04-27Airflow integration
bench_collectorchestrates collect → export → compareAdmin API
See commits
556019b(bench harness) and0474ad4(Airflow integration) for full implementation.Idea: Claude Code as a lazy judge (no Opus API spend)
Instead of (or alongside) the Haiku auto-judge, organize MLflow runs so the current Claude Code session can play judge on demand:
Schema
tip-quality-v1,prompt-ab/v1-vs-v2).prompt,context,candidate_outputs[]as artifactsjudge_pending=true,judge_kind=claude-code|haikurubric=tip-v1)Flow
judge_pending=true..claude/commands/judge-mlflow.md(or a smallml/experiments/judge_cli.py):judge_pending=true AND judge_kind=claude-codeml/experiments/rubrics/tip-v1.md)relevance,actionability,tone,format_ok) viamlflow.log_metric, setsjudge_pending=false, stampsjudged_atandjudged_by=claude-code-session.Why this works
judge_kindslicing just work.Tradeoff
auto_score - human_score > Δ).Net effect: same MLflow surface, two judge tiers (auto + lazy human-in-loop), promotion criteria can require both to agree.