feat: automated prompt optimization loop — sim A/B → promote winner #95

Closed
opened 2026-04-17 08:13:44 +00:00 by alvis · 1 comment
Owner

Goal

Automate the prompt improvement cycle via offline sim A/B testing and MLflow.

COMPLETED

Implemented the lazy-judge pattern for prompt A/B evaluation, combined with model benchmarking in issue #93.

How it works

  1. collect.py generates candidates per (model × prompt × scenario) cell
  2. judge_cli.py --export pulls pending runs into JSON for Claude Code to score
  3. Rubric (tip-v1) anchors scoring: relevance/actionability/tone (1–5), format_ok, overlong
  4. Claude Code judge scores inline (zero API cost, human in the loop)
  5. judge_cli.py --apply writes metrics back to MLflow
  6. compare.py generates leaderboard by (model, prompt) cell

Results

  • Evaluated 3 prompts (v1, v2-mentor, v3-few-shot) across 4 models
  • Top prompt: v3-few-shot (composite 12.75 avg)
  • All scores stored in MLflow experiment tip-bench-2026-04-27
  • Rubric persisted in every run's tags for consistency

Airflow integration

  • DAG bench_collect orchestrates collect → export → compare
  • Configurable via dag_run.conf
  • Triggered on-demand from Airflow UI or admin API
  • Human-in-the-loop judge (Claude Code) is manual step post-export

Admin API

  • GET /api/bench/experiments, /api/bench/runs/:experiment, /api/bench/leaderboard/:experiment
  • POST /api/bench/run to trigger DAG with custom config

See commits 556019b (bench harness) and 0474ad4 (Airflow integration) for full implementation.

## Goal Automate the prompt improvement cycle via offline sim A/B testing and MLflow. ## ✅ COMPLETED Implemented the lazy-judge pattern for prompt A/B evaluation, combined with model benchmarking in issue #93. ### How it works 1. collect.py generates candidates per (model × prompt × scenario) cell 2. judge_cli.py --export pulls pending runs into JSON for Claude Code to score 3. Rubric (tip-v1) anchors scoring: relevance/actionability/tone (1–5), format_ok, overlong 4. Claude Code judge scores inline (zero API cost, human in the loop) 5. judge_cli.py --apply writes metrics back to MLflow 6. compare.py generates leaderboard by (model, prompt) cell ### Results - Evaluated 3 prompts (v1, v2-mentor, v3-few-shot) across 4 models - Top prompt: v3-few-shot (composite 12.75 avg) - All scores stored in MLflow experiment `tip-bench-2026-04-27` - Rubric persisted in every run's tags for consistency ### Airflow integration - DAG `bench_collect` orchestrates collect → export → compare - Configurable via dag_run.conf - Triggered on-demand from Airflow UI or admin API - Human-in-the-loop judge (Claude Code) is manual step post-export ### Admin API - GET /api/bench/experiments, /api/bench/runs/:experiment, /api/bench/leaderboard/:experiment - POST /api/bench/run to trigger DAG with custom config See commits 556019b (bench harness) and 0474ad4 (Airflow integration) for full implementation.
alvis added this to the M4 — MLOps at scale milestone 2026-04-17 08:13:44 +00:00
Author
Owner

Idea: Claude Code as a lazy judge (no Opus API spend)

Instead of (or alongside) the Haiku auto-judge, organize MLflow runs so the current Claude Code session can play judge on demand:

Schema

  • One MLflow experiment per judge task (e.g. tip-quality-v1, prompt-ab/v1-vs-v2).
  • Each run = one model/prompt under test. Log:
    • prompt, context, candidate_outputs[] as artifacts
    • tag judge_pending=true, judge_kind=claude-code|haiku
    • rubric version as a tag (rubric=tip-v1)

Flow

  1. Sim runner produces candidates → MLflow run created with judge_pending=true.
  2. A slash command .claude/commands/judge-mlflow.md (or a small ml/experiments/judge_cli.py):
    • queries MLflow for runs where judge_pending=true AND judge_kind=claude-code
    • pulls the prompt + candidates + the fixed rubric file (ml/experiments/rubrics/tip-v1.md)
    • the active Claude Code session scores inline, writes metrics (relevance, actionability, tone, format_ok) via mlflow.log_metric, sets judge_pending=false, stamps judged_at and judged_by=claude-code-session.
  3. The rubric file is the consistency anchor — same prompt/criteria across sessions, no drift.

Why this works

  • Cost is zero beyond the Claude Code session you're already paying for — no Opus/Haiku tokens.
  • Verdicts land in the same MLflow experiment as auto-judge runs, so dashboards and judge_kind slicing just work.
  • "Lazy" = nothing blocks; pending runs accumulate and get cleared when you sit down.

Tradeoff

  • Throughput bounded by sit-down time → fine for offline iteration (#93 model benchmark, #95 prompt A/B), not for gating promotion.
  • For gating: keep the cheap auto-judge (Haiku via LiteLLM) running ahead, and have Claude Code only spot-check disagreements (auto_score - human_score > Δ).

Net effect: same MLflow surface, two judge tiers (auto + lazy human-in-loop), promotion criteria can require both to agree.

## Idea: Claude Code as a *lazy* judge (no Opus API spend) Instead of (or alongside) the Haiku auto-judge, organize MLflow runs so the current Claude Code session can play judge on demand: **Schema** - One MLflow experiment per judge task (e.g. `tip-quality-v1`, `prompt-ab/v1-vs-v2`). - Each run = one model/prompt under test. Log: - `prompt`, `context`, `candidate_outputs[]` as artifacts - tag `judge_pending=true`, `judge_kind=claude-code|haiku` - rubric version as a tag (`rubric=tip-v1`) **Flow** 1. Sim runner produces candidates → MLflow run created with `judge_pending=true`. 2. A slash command `.claude/commands/judge-mlflow.md` (or a small `ml/experiments/judge_cli.py`): - queries MLflow for runs where `judge_pending=true AND judge_kind=claude-code` - pulls the prompt + candidates + the fixed rubric file (`ml/experiments/rubrics/tip-v1.md`) - the active Claude Code session scores inline, writes metrics (`relevance`, `actionability`, `tone`, `format_ok`) via `mlflow.log_metric`, sets `judge_pending=false`, stamps `judged_at` and `judged_by=claude-code-session`. 3. The rubric file is the consistency anchor — same prompt/criteria across sessions, no drift. **Why this works** - Cost is zero beyond the Claude Code session you're already paying for — no Opus/Haiku tokens. - Verdicts land in the same MLflow experiment as auto-judge runs, so dashboards and `judge_kind` slicing just work. - "Lazy" = nothing blocks; pending runs accumulate and get cleared when you sit down. **Tradeoff** - Throughput bounded by sit-down time → fine for offline iteration (#93 model benchmark, #95 prompt A/B), **not** for gating promotion. - For gating: keep the cheap auto-judge (Haiku via LiteLLM) running ahead, and have Claude Code only spot-check disagreements (`auto_score - human_score > Δ`). Net effect: same MLflow surface, two judge tiers (auto + lazy human-in-loop), promotion criteria can require both to agree.
alvis closed this issue 2026-04-27 12:01:51 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: alvis/oO#95