feat: automated prompt optimization loop — sim A/B → promote winner #95

New Issue

alvis · 2026-04-17T08:13:44Z

alvis commented

2026-04-17 08:13:44 +00:00

Goal

Automate the prompt improvement cycle via offline sim A/B testing and MLflow.

✅ COMPLETED

Implemented the lazy-judge pattern for prompt A/B evaluation, combined with model benchmarking in issue #93.

How it works

collect.py generates candidates per (model × prompt × scenario) cell
judge_cli.py --export pulls pending runs into JSON for Claude Code to score
Rubric (tip-v1) anchors scoring: relevance/actionability/tone (1–5), format_ok, overlong
Claude Code judge scores inline (zero API cost, human in the loop)
judge_cli.py --apply writes metrics back to MLflow
compare.py generates leaderboard by (model, prompt) cell

Results

Evaluated 3 prompts (v1, v2-mentor, v3-few-shot) across 4 models
Top prompt: v3-few-shot (composite 12.75 avg)
All scores stored in MLflow experiment tip-bench-2026-04-27
Rubric persisted in every run's tags for consistency

Airflow integration

DAG bench_collect orchestrates collect → export → compare
Configurable via dag_run.conf
Triggered on-demand from Airflow UI or admin API
Human-in-the-loop judge (Claude Code) is manual step post-export

Admin API

GET /api/bench/experiments, /api/bench/runs/:experiment, /api/bench/leaderboard/:experiment
POST /api/bench/run to trigger DAG with custom config

See commits 556019b (bench harness) and 0474ad4 (Airflow integration) for full implementation.

## Goal Automate the prompt improvement cycle via offline sim A/B testing and MLflow. ## ✅ COMPLETED Implemented the lazy-judge pattern for prompt A/B evaluation, combined with model benchmarking in issue #93. ### How it works 1. collect.py generates candidates per (model × prompt × scenario) cell 2. judge_cli.py --export pulls pending runs into JSON for Claude Code to score 3. Rubric (tip-v1) anchors scoring: relevance/actionability/tone (1–5), format_ok, overlong 4. Claude Code judge scores inline (zero API cost, human in the loop) 5. judge_cli.py --apply writes metrics back to MLflow 6. compare.py generates leaderboard by (model, prompt) cell ### Results - Evaluated 3 prompts (v1, v2-mentor, v3-few-shot) across 4 models - Top prompt: v3-few-shot (composite 12.75 avg) - All scores stored in MLflow experiment `tip-bench-2026-04-27` - Rubric persisted in every run's tags for consistency ### Airflow integration - DAG `bench_collect` orchestrates collect → export → compare - Configurable via dag_run.conf - Triggered on-demand from Airflow UI or admin API - Human-in-the-loop judge (Claude Code) is manual step post-export ### Admin API - GET /api/bench/experiments, /api/bench/runs/:experiment, /api/bench/leaderboard/:experiment - POST /api/bench/run to trigger DAG with custom config See commits 556019b (bench harness) and 0474ad4 (Airflow integration) for full implementation.

alvis added this to the M4 — MLOps at scale milestone 2026-04-17 08:13:44 +00:00

alvis commented

2026-04-26 14:25:59 +00:00

Idea: Claude Code as a lazy judge (no Opus API spend)

Instead of (or alongside) the Haiku auto-judge, organize MLflow runs so the current Claude Code session can play judge on demand:

Schema

One MLflow experiment per judge task (e.g. tip-quality-v1, prompt-ab/v1-vs-v2).
Each run = one model/prompt under test. Log:
- prompt, context, candidate_outputs[] as artifacts
- tag judge_pending=true, judge_kind=claude-code|haiku
- rubric version as a tag (rubric=tip-v1)

Flow

Sim runner produces candidates → MLflow run created with judge_pending=true.
A slash command .claude/commands/judge-mlflow.md (or a small ml/experiments/judge_cli.py):
- queries MLflow for runs where judge_pending=true AND judge_kind=claude-code
- pulls the prompt + candidates + the fixed rubric file (ml/experiments/rubrics/tip-v1.md)
- the active Claude Code session scores inline, writes metrics (relevance, actionability, tone, format_ok) via mlflow.log_metric, sets judge_pending=false, stamps judged_at and judged_by=claude-code-session.
The rubric file is the consistency anchor — same prompt/criteria across sessions, no drift.

Why this works

Cost is zero beyond the Claude Code session you're already paying for — no Opus/Haiku tokens.
Verdicts land in the same MLflow experiment as auto-judge runs, so dashboards and judge_kind slicing just work.
"Lazy" = nothing blocks; pending runs accumulate and get cleared when you sit down.

Tradeoff

Throughput bounded by sit-down time → fine for offline iteration (#93 model benchmark, #95 prompt A/B), not for gating promotion.
For gating: keep the cheap auto-judge (Haiku via LiteLLM) running ahead, and have Claude Code only spot-check disagreements (auto_score - human_score > Δ).

Net effect: same MLflow surface, two judge tiers (auto + lazy human-in-loop), promotion criteria can require both to agree.

## Idea: Claude Code as a *lazy* judge (no Opus API spend) Instead of (or alongside) the Haiku auto-judge, organize MLflow runs so the current Claude Code session can play judge on demand: **Schema** - One MLflow experiment per judge task (e.g. `tip-quality-v1`, `prompt-ab/v1-vs-v2`). - Each run = one model/prompt under test. Log: - `prompt`, `context`, `candidate_outputs[]` as artifacts - tag `judge_pending=true`, `judge_kind=claude-code|haiku` - rubric version as a tag (`rubric=tip-v1`) **Flow** 1. Sim runner produces candidates → MLflow run created with `judge_pending=true`. 2. A slash command `.claude/commands/judge-mlflow.md` (or a small `ml/experiments/judge_cli.py`): - queries MLflow for runs where `judge_pending=true AND judge_kind=claude-code` - pulls the prompt + candidates + the fixed rubric file (`ml/experiments/rubrics/tip-v1.md`) - the active Claude Code session scores inline, writes metrics (`relevance`, `actionability`, `tone`, `format_ok`) via `mlflow.log_metric`, sets `judge_pending=false`, stamps `judged_at` and `judged_by=claude-code-session`. 3. The rubric file is the consistency anchor — same prompt/criteria across sessions, no drift. **Why this works** - Cost is zero beyond the Claude Code session you're already paying for — no Opus/Haiku tokens. - Verdicts land in the same MLflow experiment as auto-judge runs, so dashboards and `judge_kind` slicing just work. - "Lazy" = nothing blocks; pending runs accumulate and get cleared when you sit down. **Tradeoff** - Throughput bounded by sit-down time → fine for offline iteration (#93 model benchmark, #95 prompt A/B), **not** for gating promotion. - For gating: keep the cheap auto-judge (Haiku via LiteLLM) running ahead, and have Claude Code only spot-check disagreements (`auto_score - human_score > Δ`). Net effect: same MLflow surface, two judge tiers (auto + lazy human-in-loop), promotion criteria can require both to agree.

alvis referenced this issue

2026-04-27 05:20:11 +00:00

research: model benchmark for tip generation — qwen2.5 vs llama3.2 vs gemma3 #93

alvis closed this issue

2026-04-27 12:01:51 +00:00

alvis referenced this issue from a commit

2026-05-03 16:39:02 +00:00

feat(bench): MLflow-based tip-generation benchmark harness (#93, #95)

Sign in to join this conversation.