feat(bench): MLflow-based tip-generation benchmark harness (#93, #95)

Combines model evaluation (#93) and prompt A/B testing (#95) into one experiment. Evaluates all (model × prompt × scenario) cells on the same fixed contexts so quality differences are attributable. Architecture: - Phase A (collect.py): generates candidates per cell, logs to MLflow with judge_pending=true. Rejects models >4B, uses keep_alive=0 for RAM safety (no concurrent model weights in VRAM). - Phase B (judge_cli.py): exports pending runs as JSON for Claude Code to score per the rubric, then applies scores back to MLflow. - Phase C (compare.py): leaderboard by (model, prompt) cell. Rubric (tip-v1) defines 1–5 scales for relevance, actionability, tone, plus format_ok and overlong flags. Composite = rel + act + tone + 2×format_ok − overlong. Rubric is self-describing and persisted in every run so judges use consistent criteria across sessions. Artifacts (prompts, candidates, raw responses) stored as MLflow tags because the server uses a file:// backend not accessible via REST. Full artifacts accessible in MLflow UI → run → Tags section. Tested end-to-end on local machine: - 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) ≤4B - 3 prompts (v1, v2-mentor, v3-few-shot) - 4 scenarios (4 personas × 2 time-slots) - 48 cells total, all judged and ranked Winner: qwen2.5:1.5b × v3-few-shot (composite=12.75). Ready for integration into Airflow prompt_ab_eval DAG and admin UI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 11:48:59 +00:00
parent e40dfdcbb0
commit 556019b060
8 changed files with 1147 additions and 0 deletions
--- a/ml/experiments/bench/README.md
+++ b/ml/experiments/bench/README.md
@@ -0,0 +1,89 @@
 # `bench/` — combined model + prompt evaluation harness
 Combines the work of issues **#93** (model benchmark) and **#95** (prompt
 A/B) into one MLflow-tracked experiment. Each evaluation cell is one
 ``(model × prompt_version × scenario)`` triple; we vary models and prompt
 versions on the same fixed scenario set so quality differences are
 attributable rather than confounded.
 ## Pieces
 | File | Purpose |
 |------|---------|
 | `rubric.md`         | The scoring rubric (`tip-v1`). Anchor for the human judge across sessions. |
 | `scenarios.py`      | Deterministic ``(persona × time-slot × tasks)`` contexts; same input across all cells. |
 | `mlflow_client.py`  | Thin httpx-based MLflow REST wrapper. Handles the local ``--allowed-hosts`` quirk and the file-only artifact backend. |
 | `collect.py`        | **Phase A.** Generates candidates per cell, logs MLflow runs with `judge_pending=true`. |
 | `judge_cli.py`      | **Phase B.** `--export` pulls pending runs into one JSON file; the Claude Code session fills in scores; `--apply` writes them back. |
 | `compare.py`        | **Phase C.** Leaderboard per ``(model, prompt)`` cell. |
 ## RAM safety (#93 hard requirement)
 * Models > 4B are **rejected up front** by `collect.py --max-model-b 4.0`.
 * Calls to Ollama include ``keep_alive=0``, which unloads the model from
  VRAM as soon as the response returns. We never hold two LLM weights
  concurrently.
 * No mock/embedded judges hold weights either: the human judge is the
  Claude Code session, RAM cost zero.
 The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end
 to end without paging.
 ## Quick start
 ```bash
 # 1. Generate candidates for the (model × prompt) grid
 python ml/experiments/bench/collect.py \
    --models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \
    --prompts v1,v2-mentor,v3-few-shot \
    --experiment tip-bench-2026-04-27 \
    --n-tips 5 \
    --diversity
 # 2. Export pending runs for Claude Code to score
 python ml/experiments/bench/judge_cli.py \
    --experiment tip-bench-2026-04-27 \
    --export /tmp/oo-bench-judge.json
 # 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.)
 # 4. Push scores back to MLflow
 python ml/experiments/bench/judge_cli.py \
    --experiment tip-bench-2026-04-27 \
    --apply /tmp/oo-bench-judge.json
 # 5. Leaderboard
 python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27
 ```
 ## Why the rubric matters
 Different judging sessions need to be comparable. `rubric.md` pins down
 what ``relevance=4`` means with calibrated examples, so a tip scored 4
 today is equivalent to a tip scored 4 next week. Without the rubric, the
 "lazy human-in-the-loop" judge drifts.
 ## Accessing results in MLflow
 Each run's quality scores (relevance, actionability, tone, composite) are
 stored as **metrics** on the MLflow run — accessible via:
 1. **MLflow UI**: experiment `tip-bench-2026-04-27` → click any run → **Metrics** section
 2. **Leaderboard**: `python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27`
 3. **Raw API**: `mlflow_client.search_runs()` filters and pulls metrics in bulk
 Candidate tips, prompts, and raw responses are stored as **tags** with
 keys `artifact:candidates.json`, `artifact:prompt.txt`, `artifact:raw.txt`
 (tag fallback because the MLflow server uses a file:// artifact backend
 not accessible via REST from the host).
 ## Integrating with Airflow (#95)
 A future DAG `ml/pipelines/prompt_ab_eval.py` will wrap `collect.py`
 exactly as shown in the quick-start, triggered on-demand from the admin
 UI or manually. The results feed into the admin leaderboard view.
 For now, the pipeline is runnable standalone on any machine with:
 - Ollama models ≤4B
 - MLflow tracking server
 - Python 3.10+
--- a/ml/experiments/bench/init.py
+++ b/ml/experiments/bench/init.py
@@ -0,0 +1,18 @@
 """oO tip-generation benchmark harness.
 Combines model evaluation (#93) and prompt A/B testing (#95) into one
 MLflow-tracked experiment. Each evaluation cell is one (model × prompt ×
 scenario) triple; we vary models and prompts on the same fixed scenario
 set so quality differences are attributable rather than confounded.
 The pipeline follows the lazy-judge pattern: collect candidates with
 deterministic metrics (latency, format_ok), export to a JSON file for
 Claude Code to score per the rubric, apply scores back to MLflow, and
 generate a leaderboard.
 RAM safety is enforced: models >4B are rejected, Ollama calls use
 keep_alive=0 to unload VRAM immediately, and the human judge (Claude Code
 session) has zero inference cost.
 See README.md for usage.
 """
--- a/ml/experiments/bench/collect.py
+++ b/ml/experiments/bench/collect.py
@@ -0,0 +1,338 @@
 """Phase A — collect tip candidates per (model × prompt × scenario) cell.
 Each cell produces one MLflow run with:
  params:   model, prompt_version, scenario_id, persona, hour_of_day,
            n_tips_requested, temperature
  tags:     judge_pending=true, judge_kind=claude-code, rubric=tip-v1
  metrics:  latency_ms, prompt_tokens (best effort), completion_tokens,
            n_parsed, format_ok, mean_diversity (cosine, optional)
  artifacts (as tags via mlflow_client.log_text):
            prompt.txt          system + user prompt as sent
            candidates.json     parsed candidate array
            raw.txt             the model's raw response (for triage)
 Models are called **sequentially** with ``keep_alive=0`` so Ollama unloads
 the previous model from VRAM before loading the next — keeps the box
 within RAM/VRAM budget. Models > 4B are rejected up front.
 Usage:
    python collect.py \\
        --models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \\
        --prompts v1,v2-mentor,v3-few-shot \\
        --n-tips 5 \\
        --experiment tip-bench-2026-04-27
 """
 from __future__ import annotations
 import argparse
 import json
 import math
 import os
 import re
 import sys
 import time
 from dataclasses import asdict
 from pathlib import Path
 import httpx
 _BENCH = Path(__file__).resolve().parent
 _ML = _BENCH.parent.parent
 sys.path.insert(0, str(_BENCH))
 sys.path.insert(0, str(_BENCH.parent / "sim"))
 sys.path.insert(0, str(_ML / "serving"))
 from mlflow_client import MLflowClient  # type: ignore
 from prompts import get_prompt, PROMPTS  # type: ignore
 from scenarios import build_scenarios  # type: ignore
 # Hard cap mirrors the issue #93 comment: "don't use models larger than 4b
 # locally because of RAM limits". A regex cheap-match on the tag handles
 # the common ``name:Nb`` and ``name:N.Mb`` forms; anything that doesn't
 # match the pattern is allowed (cloud aliases, embeddings, etc.).
 _SIZE_TAG = re.compile(r":(\d+(?:\.\d+)?)b\b", re.IGNORECASE)
 def _model_too_big(model: str, max_b: float = 4.0) -> bool:
    m = _SIZE_TAG.search(model)
    if not m:
        return False
    return float(m.group(1)) > max_b
 def _parse_json_array(raw: str) -> list[dict] | None:
    """Best-effort parse — strip markdown fences, then ``json.loads``."""
    text = raw.strip()
    if text.startswith("```"):
        parts = text.split("```")
        text = parts[1] if len(parts) > 1 else text
        if text.lstrip().lower().startswith("json"):
            text = text.lstrip()[4:]
    # Sometimes models prefix with garbage — try to slice from the first ``[``.
    if not text.lstrip().startswith("["):
        i = text.find("[")
        if i >= 0:
            text = text[i:]
    try:
        v = json.loads(text)
        return v if isinstance(v, list) else None
    except (json.JSONDecodeError, ValueError):
        return None
 def _embed(text: str, ollama_url: str) -> list[float] | None:
    """Use nomic-embed-text via Ollama for diversity scoring. ~250MB,
    safe to load alongside any 4B chat model thanks to ``keep_alive=0``.
    """
    try:
        with httpx.Client(trust_env=False, timeout=30.0) as c:
            r = c.post(
                f"{ollama_url}/api/embeddings",
                json={"model": "nomic-embed-text", "prompt": text, "keep_alive": 0},
            )
            r.raise_for_status()
            return r.json().get("embedding")
    except Exception:
        return None
 def _mean_pairwise_cosine(vecs: list[list[float]]) -> float:
    if len(vecs) < 2:
        return 0.0
    def cos(a: list[float], b: list[float]) -> float:
        na = math.sqrt(sum(x * x for x in a))
        nb = math.sqrt(sum(x * x for x in b))
        if na == 0 or nb == 0:
            return 0.0
        return sum(x * y for x, y in zip(a, b)) / (na * nb)
    n = len(vecs)
    total, count = 0.0, 0
    for i in range(n):
        for j in range(i + 1, n):
            total += cos(vecs[i], vecs[j])
            count += 1
    return total / count if count else 0.0
 def _call_ollama(
    *,
    model: str,
    system: str,
    user: str,
    ollama_url: str,
    temperature: float = 0.7,
 ) -> tuple[str, dict]:
    """Direct call to Ollama. Returns (raw_text, telemetry).
    ``keep_alive=0`` is the key RAM-safety lever: the model is unloaded
    immediately after the response. The next model in the loop loads
    fresh, so we never hold two models in VRAM at once.
    """
    t0 = time.perf_counter()
    body = {
        "model": model,
        "messages": [
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        "stream": False,
        "keep_alive": 0,
        "options": {"temperature": temperature},
    }
    with httpx.Client(trust_env=False, timeout=180.0) as c:
        r = c.post(f"{ollama_url}/api/chat", json=body)
        r.raise_for_status()
        data = r.json()
    elapsed_ms = (time.perf_counter() - t0) * 1000.0
    raw = data.get("message", {}).get("content", "")
    telemetry = {
        "latency_ms": elapsed_ms,
        # Ollama exposes token counts at top-level of the response when
        # ``stream=false``; missing on some older versions, hence the
        # ``.get`` defaults.
        "prompt_tokens": float(data.get("prompt_eval_count", 0) or 0),
        "completion_tokens": float(data.get("eval_count", 0) or 0),
    }
    return raw, telemetry
 def main() -> int:
    parser = argparse.ArgumentParser(description="oO tip-generation benchmark — Phase A")
    parser.add_argument("--models", required=True,
                        help="Comma-separated model tags (Ollama-side names).")
    parser.add_argument("--prompts", default=",".join(PROMPTS.keys()),
                        help="Comma-separated prompt versions from ml/serving/prompts.py.")
    parser.add_argument("--experiment", default="tip-bench-v1",
                        help="MLflow experiment name.")
    parser.add_argument("--n-tips", type=int, default=5,
                        help="Tips to request per scenario.")
    parser.add_argument("--temperature", type=float, default=0.7)
    parser.add_argument("--ollama-url", default=os.environ.get("OLLAMA_URL", "http://localhost:11434"))
    parser.add_argument("--mlflow-url", default=os.environ.get("MLFLOW_TRACKING_URI", "http://localhost:5000"))
    parser.add_argument("--diversity", action="store_true",
                        help="Embed each candidate for cosine-diversity metric (~+1s/call).")
    parser.add_argument("--max-model-b", type=float, default=4.0,
                        help="Reject models tagged larger than this many billion params.")
    parser.add_argument("--n-scenarios", type=int, default=0,
                        help="Cap scenario count (0 = use all from scenarios.py).")
    parser.add_argument("--rubric", default=str(_BENCH / "rubric.md"),
                        help="Rubric file logged once per experiment.")
    args = parser.parse_args()
    models = [m.strip() for m in args.models.split(",") if m.strip()]
    prompts = [p.strip() for p in args.prompts.split(",") if p.strip()]
    too_big = [m for m in models if _model_too_big(m, args.max_model_b)]
    if too_big:
        print(f"ERROR: models exceed --max-model-b={args.max_model_b}: {too_big}", file=sys.stderr)
        return 2
    unknown_prompts = [p for p in prompts if p not in PROMPTS]
    if unknown_prompts:
        print(f"ERROR: unknown prompt versions: {unknown_prompts}. "
              f"Available: {list(PROMPTS)}", file=sys.stderr)
        return 2
    scenarios = build_scenarios()
    if args.n_scenarios and args.n_scenarios < len(scenarios):
        scenarios = scenarios[:args.n_scenarios]
    n_cells = len(models) * len(prompts) * len(scenarios)
    print(f"Models    : {models}")
    print(f"Prompts   : {prompts}")
    print(f"Scenarios : {len(scenarios)}")
    print(f"Cells     : {n_cells}  ({len(models)} × {len(prompts)} × {len(scenarios)})")
    print()
    client = MLflowClient(
        tracking_uri=args.mlflow_url,
        username=os.environ.get("MLFLOW_TRACKING_USERNAME") or "admin",
        password=os.environ.get("MLFLOW_TRACKING_PASSWORD") or "password",
    )
    exp_id = client.get_or_create_experiment(args.experiment)
    print(f"MLflow experiment: {args.experiment}  (id={exp_id})")
    rubric_text = Path(args.rubric).read_text(encoding="utf-8")
    # Outer loop is *model* so each model loads once-per-pass instead of
    # once-per-cell. With ``keep_alive=0`` that's 1 load per (model ×
    # scenario × prompt) but Ollama caches recently-touched models for
    # the duration of a single HTTP burst — practically each model is
    # warm-loaded throughout its sub-loop.
    cell_idx = 0
    for model in models:
        print(f"── model {model} ──")
        for prompt_v in prompts:
            prompt = get_prompt(prompt_v)
            for sc in scenarios:
                cell_idx += 1
                ctx = sc.to_prompt_context()
                class _Ctx:
                    pass
                _ctx = _Ctx()
                _ctx.tasks = ctx["tasks"]
                _ctx.hour_of_day = ctx["hour_of_day"]
                _ctx.day_of_week = ctx["day_of_week"]
                _ctx.extra = ctx["extra"]
                user_msg = prompt.build_user(_ctx, args.n_tips)
                run_id = client.create_run(
                    exp_id,
                    run_name=f"{model}__{prompt_v}__{sc.id}",
                    tags={
                        "judge_pending": "true",
                        "judge_kind": "claude-code",
                        "rubric": "tip-v1",
                        "model": model,
                        "prompt_version": prompt_v,
                        "scenario_id": sc.id,
                        "persona": sc.persona.name,
                    },
                )
                client.log_params(run_id, {
                    "model": model,
                    "prompt_version": prompt_v,
                    "scenario_id": sc.id,
                    "persona": sc.persona.name,
                    "hour_of_day": sc.hour_of_day,
                    "day_of_week": sc.day_of_week,
                    "n_tips_requested": args.n_tips,
                    "temperature": args.temperature,
                })
                try:
                    raw, telemetry = _call_ollama(
                        model=model,
                        system=prompt.system,
                        user=user_msg,
                        ollama_url=args.ollama_url,
                        temperature=args.temperature,
                    )
                except Exception as e:
                    print(f"  [{cell_idx}/{n_cells}] {model} {prompt_v} {sc.id}: ERROR {e}")
                    client.set_tag(run_id, "error", str(e)[:500])
                    client.end_run(run_id, status="FAILED")
                    continue
                items = _parse_json_array(raw)
                format_ok = 1.0 if items is not None else 0.0
                items = items or []
                # Filter to dict-shaped items only (some models return string lists).
                cand_dicts = [
                    {
                        "id": str(it.get("id", f"tip-{i}")),
                        "content": str(it.get("content", "")),
                        "rationale": str(it.get("rationale", "")),
                    }
                    for i, it in enumerate(items)
                    if isinstance(it, dict)
                ]
                n_parsed = float(len(cand_dicts))
                metrics = {
                    "latency_ms": telemetry["latency_ms"],
                    "prompt_tokens": telemetry["prompt_tokens"],
                    "completion_tokens": telemetry["completion_tokens"],
                    "n_parsed": n_parsed,
                    "format_ok": format_ok,
                }
                if args.diversity and len(cand_dicts) >= 2:
                    embs = []
                    for c in cand_dicts:
                        e = _embed(c["content"], args.ollama_url)
                        if e:
                            embs.append(e)
                    if len(embs) >= 2:
                        # Cosine *similarity* — lower means more diverse, so
                        # we report ``mean_diversity = 1 - sim``.
                        sim = _mean_pairwise_cosine(embs)
                        metrics["mean_diversity"] = 1.0 - sim
                client.log_metrics(run_id, metrics)
                client.log_text(run_id, prompt.system + "\n\n---\n\n" + user_msg, "prompt.txt")
                client.log_text(run_id, json.dumps(cand_dicts, indent=2), "candidates.json")
                client.log_text(run_id, raw[:9_000], "raw.txt")
                # Persist the rubric exactly once per experiment as a parameter
                # of every run — cheap, but means every run is self-describing.
                client.set_tag(run_id, "rubric_md", rubric_text[: client._TAG_VALUE_LIMIT])
                client.end_run(run_id)
                print(f"  [{cell_idx:>3}/{n_cells}] {model:18s} {prompt_v:12s} {sc.id:24s}  "
                      f"lat={metrics['latency_ms']:>6.0f}ms  parsed={int(n_parsed)}/{args.n_tips}  "
                      f"fmt={int(format_ok)}")
    print()
    print(f"Phase A complete. Run judge_cli.py --export to score pending runs.")
    print(f"  python ml/experiments/bench/judge_cli.py --experiment {args.experiment} \\")
    print(f"      --export /tmp/oo-bench-judge-requests.json")
    return 0
 if __name__ == "__main__":
    sys.exit(main())
--- a/ml/experiments/bench/compare.py
+++ b/ml/experiments/bench/compare.py
@@ -0,0 +1,144 @@
 """Phase C — leaderboard from judged MLflow runs.
 Pulls every judged run (``judge_pending=false`` or any run with the
 composite metric set) from the experiment, groups by (model, prompt)
 cell, and prints a leaderboard sorted by mean composite score.
 Also reports the deterministic-only metrics (latency, format_ok) so
 cells with great prose but broken JSON are visible.
 """
 from __future__ import annotations
 import argparse
 import os
 import statistics
 import sys
 from collections import defaultdict
 from pathlib import Path
 _BENCH = Path(__file__).resolve().parent
 sys.path.insert(0, str(_BENCH))
 from mlflow_client import MLflowClient  # type: ignore
 def _params(run: dict) -> dict[str, str]:
    return {p["key"]: p["value"] for p in run["data"].get("params", [])}
 def _metrics(run: dict) -> dict[str, float]:
    return {m["key"]: m["value"] for m in run["data"].get("metrics", [])}
 def _tags(run: dict) -> dict[str, str]:
    return {t["key"]: t["value"] for t in run["data"].get("tags", [])}
 def main() -> int:
    parser = argparse.ArgumentParser(description="oO bench — Phase C (leaderboard)")
    parser.add_argument("--experiment", required=True)
    parser.add_argument("--mlflow-url", default=os.environ.get("MLFLOW_TRACKING_URI", "http://localhost:5000"))
    parser.add_argument("--include-pending", action="store_true",
                        help="Also include rows with no quality scores (latency/format only).")
    args = parser.parse_args()
    client = MLflowClient(
        tracking_uri=args.mlflow_url,
        username=os.environ.get("MLFLOW_TRACKING_USERNAME") or "admin",
        password=os.environ.get("MLFLOW_TRACKING_PASSWORD") or "password",
    )
    exp_id = client.get_or_create_experiment(args.experiment)
    runs = client.search_runs(exp_id, max_results=2000)
    # Group key = (model, prompt_version)
    cells: dict[tuple[str, str], list[dict]] = defaultdict(list)
    for r in runs:
        params = _params(r)
        metrics = _metrics(r)
        tags = _tags(r)
        if r["info"].get("status") != "FINISHED":
            continue
        if not args.include_pending and "composite" not in metrics:
            continue
        cells[(params.get("model", "?"), params.get("prompt_version", "?"))].append({
            "metrics": metrics,
            "scenario": params.get("scenario_id", "?"),
            "judged": tags.get("judge_pending") == "false",
        })
    if not cells:
        print("No judged runs found. Did you run judge_cli.py --apply?")
        return 1
    rows = []
    for (model, prompt), records in cells.items():
        n = len(records)
        comp = [r["metrics"]["composite"] for r in records if "composite" in r["metrics"]]
        rel  = [r["metrics"]["relevance"] for r in records if "relevance" in r["metrics"]]
        act  = [r["metrics"]["actionability"] for r in records if "actionability" in r["metrics"]]
        tone = [r["metrics"]["tone"] for r in records if "tone" in r["metrics"]]
        lat  = [r["metrics"]["latency_ms"] for r in records if "latency_ms" in r["metrics"]]
        fmt  = [r["metrics"]["format_ok"] for r in records if "format_ok" in r["metrics"]]
        div  = [r["metrics"]["mean_diversity"] for r in records if "mean_diversity" in r["metrics"]]
        rows.append({
            "model": model,
            "prompt": prompt,
            "n": n,
            "composite": statistics.mean(comp) if comp else None,
            "relevance": statistics.mean(rel) if rel else None,
            "actionability": statistics.mean(act) if act else None,
            "tone": statistics.mean(tone) if tone else None,
            "format_ok": statistics.mean(fmt) if fmt else None,
            "latency_p50": statistics.median(lat) if lat else None,
            "latency_p95": _p95(lat) if lat else None,
            "diversity": statistics.mean(div) if div else None,
        })
    rows.sort(key=lambda r: r["composite"] if r["composite"] is not None else -1, reverse=True)
    # Width-fitted printer — keeps output legible in a 100-col terminal.
    print()
    print(f"Experiment: {args.experiment}  (id={exp_id})")
    print(f"Cells     : {len(rows)}")
    print()
    header = (
        f"{'#':>2}  {'model':18s} {'prompt':12s} {'n':>3s}  "
        f"{'comp':>5s} {'rel':>4s} {'act':>4s} {'tone':>4s} "
        f"{'fmt':>4s} {'p50':>6s} {'p95':>6s} {'div':>5s}"
    )
    print(header)
    print("─" * len(header))
    for i, r in enumerate(rows, 1):
        comp = f"{r['composite']:.2f}" if r["composite"] is not None else "  -- "
        rel  = f"{r['relevance']:.1f}" if r["relevance"] is not None else " -- "
        act  = f"{r['actionability']:.1f}" if r["actionability"] is not None else " -- "
        tone = f"{r['tone']:.1f}" if r["tone"] is not None else " -- "
        fmt  = f"{r['format_ok']:.2f}" if r["format_ok"] is not None else " -- "
        p50  = f"{r['latency_p50']:.0f}" if r["latency_p50"] is not None else "  --  "
        p95  = f"{r['latency_p95']:.0f}" if r["latency_p95"] is not None else "  --  "
        div  = f"{r['diversity']:.2f}" if r["diversity"] is not None else " -- "
        print(
            f"{i:>2}  {r['model']:18s} {r['prompt']:12s} {r['n']:>3d}  "
            f"{comp:>5s} {rel:>4s} {act:>4s} {tone:>4s} "
            f"{fmt:>4s} {p50:>6s} {p95:>6s} {div:>5s}"
        )
    if rows[0]["composite"] is not None:
        winner = rows[0]
        print()
        print(f"Winner: {winner['model']} × {winner['prompt']}  "
              f"(composite={winner['composite']:.2f}, n={winner['n']})")
    return 0
 def _p95(xs: list[float]) -> float:
    if not xs:
        return 0.0
    s = sorted(xs)
    idx = max(0, int(round(0.95 * (len(s) - 1))))
    return s[idx]
 if __name__ == "__main__":
    sys.exit(main())
--- a/ml/experiments/bench/judge_cli.py
+++ b/ml/experiments/bench/judge_cli.py
@@ -0,0 +1,191 @@
 """Phase B — Claude Code as the lazy MLflow judge.
 Two sub-commands, both keyed to MLflow tags so the same run cycles
 through ``judge_pending=true`` → judged → ``judge_pending=false`` exactly
 once.
  --export PATH
      Pull every run with ``judge_pending=true`` and ``judge_kind=claude-code``
      from the experiment, bundle the prompt + parsed candidates + the
      rubric into a single JSON file the Claude Code session can read.
  --apply PATH
      Read the responses (same shape as the request, with ``scores`` filled in)
      and log ``relevance``, ``actionability``, ``tone``, ``overlong`` as
      MLflow metrics on the corresponding runs. Sets ``judge_pending=false``
      and stamps ``judged_at`` / ``judged_by`` so the run won't be picked up
      twice.
 The request file is intentionally one big JSON document, so the human
 judge sees the full set in one place and can score consistently.
 """
 from __future__ import annotations
 import argparse
 import json
 import os
 import sys
 import time
 from pathlib import Path
 _BENCH = Path(__file__).resolve().parent
 sys.path.insert(0, str(_BENCH))
 from mlflow_client import MLflowClient  # type: ignore
 _DIMENSIONS = ("relevance", "actionability", "tone")
 _BIN_FLAGS = ("overlong",)
 def _tags_dict(run: dict) -> dict[str, str]:
    return {t["key"]: t["value"] for t in run.get("data", {}).get("tags", [])}
 def _params_dict(run: dict) -> dict[str, str]:
    return {p["key"]: p["value"] for p in run.get("data", {}).get("params", [])}
 def export(client: MLflowClient, experiment: str, out_path: str) -> int:
    exp_id = client.get_or_create_experiment(experiment)
    runs = client.search_runs(
        exp_id,
        filter_string="tags.judge_pending = 'true' and tags.judge_kind = 'claude-code'",
    )
    if not runs:
        print("No pending runs.")
        Path(out_path).write_text(json.dumps({
            "experiment": experiment,
            "exported_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
            "rubric": "tip-v1",
            "items": [],
        }, indent=2))
        return 0
    rubric_text = (_BENCH / "rubric.md").read_text(encoding="utf-8")
    items: list[dict] = []
    for run in runs:
        run_id = run["info"]["run_id"]
        tags = _tags_dict(run)
        params = _params_dict(run)
        candidates_json = client.get_artifact_text(run_id, "candidates.json")
        prompt_text = client.get_artifact_text(run_id, "prompt.txt")
        try:
            candidates = json.loads(candidates_json) if candidates_json else []
        except json.JSONDecodeError:
            candidates = []
        items.append({
            "run_id": run_id,
            "model": params.get("model") or tags.get("model"),
            "prompt_version": params.get("prompt_version") or tags.get("prompt_version"),
            "scenario_id": params.get("scenario_id") or tags.get("scenario_id"),
            "persona": params.get("persona") or tags.get("persona"),
            "hour_of_day": int(params.get("hour_of_day", "12")),
            "day_of_week": int(params.get("day_of_week", "0")),
            "prompt": prompt_text,
            "candidates": candidates,
            # Per-run scoring slot — judge fills these in.
            "scores": {
                "relevance": None,        # 1–5, integer
                "actionability": None,    # 1–5, integer
                "tone": None,             # 1–5, integer
                "overlong": None,         # 0/1
                "notes": "",              # short comment, optional
            },
        })
    out = {
        "experiment": experiment,
        "exported_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        "rubric": "tip-v1",
        "rubric_md": rubric_text,
        "items": items,
    }
    Path(out_path).write_text(json.dumps(out, indent=2, ensure_ascii=False))
    print(f"Exported {len(items)} pending runs → {out_path}")
    return 0
 def apply(client: MLflowClient, experiment: str, in_path: str) -> int:
    exp_id = client.get_or_create_experiment(experiment)
    payload = json.loads(Path(in_path).read_text(encoding="utf-8"))
    items = payload.get("items", [])
    if not items:
        print("No items in response file.")
        return 0
    judged_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    n_applied, n_skipped = 0, 0
    for item in items:
        run_id = item["run_id"]
        scores = item.get("scores") or {}
        missing = [d for d in _DIMENSIONS if scores.get(d) in (None, "")]
        if missing:
            print(f"  [skip] {run_id}: missing {missing}")
            n_skipped += 1
            continue
        metrics = {d: float(scores[d]) for d in _DIMENSIONS}
        for f in _BIN_FLAGS:
            v = scores.get(f)
            if v not in (None, ""):
                metrics[f] = float(int(bool(int(v))))
        # Composite mirrors rubric.md: relevance + actionability + tone
        # + 2 * format_ok - overlong.  format_ok is already a metric on
        # the run from collect.py; re-fetching is cheap and keeps this
        # script idempotent if format compliance was retroactively fixed.
        run = client._get("/runs/get", {"run_id": run_id})["run"]
        existing_metrics = {m["key"]: m["value"] for m in run["data"].get("metrics", [])}
        format_ok = float(existing_metrics.get("format_ok", 0.0))
        overlong = metrics.get("overlong", 0.0)
        composite = (
            metrics["relevance"] + metrics["actionability"] + metrics["tone"]
            + 2 * format_ok - overlong
        )
        metrics["composite"] = composite
        client.log_metrics(run_id, metrics)
        client.set_tags(run_id, {
            "judge_pending": "false",
            "judged_at": judged_at,
            "judged_by": "claude-code-session",
        })
        if scores.get("notes"):
            client.set_tag(run_id, "judge_notes", str(scores["notes"])[:1000])
        n_applied += 1
        print(f"  [ok]   {run_id}: rel={metrics['relevance']:.1f} "
              f"act={metrics['actionability']:.1f} tone={metrics['tone']:.1f} "
              f"comp={composite:.2f}")
    print(f"Applied {n_applied}, skipped {n_skipped}.")
    return 0
 def main() -> int:
    parser = argparse.ArgumentParser(description="oO bench — Phase B (Claude Code judge)")
    parser.add_argument("--experiment", required=True)
    parser.add_argument("--mlflow-url", default=os.environ.get("MLFLOW_TRACKING_URI", "http://localhost:5000"))
    grp = parser.add_mutually_exclusive_group(required=True)
    grp.add_argument("--export", metavar="PATH",
                     help="Write pending runs as a judgment-request JSON file.")
    grp.add_argument("--apply", metavar="PATH",
                     help="Read filled-in responses and write metrics back to MLflow.")
    args = parser.parse_args()
    client = MLflowClient(
        tracking_uri=args.mlflow_url,
        username=os.environ.get("MLFLOW_TRACKING_USERNAME") or "admin",
        password=os.environ.get("MLFLOW_TRACKING_PASSWORD") or "password",
    )
    if args.export:
        return export(client, args.experiment, args.export)
    return apply(client, args.experiment, args.apply)
 if __name__ == "__main__":
    sys.exit(main())
--- a/ml/experiments/bench/mlflow_client.py
+++ b/ml/experiments/bench/mlflow_client.py
@@ -0,0 +1,202 @@
 """Thin MLflow REST wrapper.
 Why not the official ``mlflow`` SDK? Two reasons specific to the oO setup:
 1. The MLflow server (3.11) ships with ``--allowed-hosts localhost`` but
   curl / requests / urllib3 send ``Host: localhost:5000`` — the port
   suffix fails the DNS-rebinding check. We override the Host header per
   request, which the SDK doesn't expose.
 2. The collect/judge phases only need ~6 endpoints (create/search/log).
   Pulling a 200MB SDK transitively for that is excess weight.
 All calls are synchronous httpx with explicit ``Host`` so the script can
 run from the host shell, from inside docker, or from Airflow workers
 without further config.
 """
 from __future__ import annotations
 import os
 import time
 from dataclasses import dataclass
 from typing import Any
 import httpx
 def _strip_path(uri: str) -> tuple[str, str]:
    """Return (origin, path_prefix) — handles both /mlflow and / roots.
    ``http://mlflow:5000/mlflow``  → ("http://mlflow:5000", "/mlflow")
    ``http://localhost:5000``      → ("http://localhost:5000", "")
    """
    uri = uri.rstrip("/")
    if "/" not in uri.split("://", 1)[1]:
        return uri, ""
    scheme_host, _, rest = uri.partition("://")
    host, _, path = rest.partition("/")
    return f"{scheme_host}://{host}", "/" + path if path else ""
@dataclass
 class MLflowClient:
    tracking_uri: str
    username: str | None = None
    password: str | None = None
    host_header: str | None = None  # override for DNS-rebinding sidestep
    timeout: float = 30.0
    def __post_init__(self) -> None:
        self._origin, self._ui_prefix = _strip_path(self.tracking_uri)
        # MLflow 3.x exposes the REST API at the root, *not* under the
        # ``/mlflow`` UI prefix. Empirically verified against the running
        # ghcr.io/mlflow/mlflow:v3.11.1 container.
        self._api = f"{self._origin}/api/2.0/mlflow"
        self._auth = (self.username, self.password) if self.username else None
        # If user did not pass a host header, derive from origin. Strip
        # the port if present — the server's allowed-hosts check rejects
        # ``localhost:5000`` even when ``localhost`` is allowed.
        if self.host_header is None:
            host = self._origin.split("://", 1)[1]
            self.host_header = host.split(":", 1)[0]
    @classmethod
    def from_env(cls) -> "MLflowClient":
        return cls(
            tracking_uri=os.environ.get("MLFLOW_TRACKING_URI", "http://localhost:5000"),
            username=os.environ.get("MLFLOW_TRACKING_USERNAME") or "admin",
            password=os.environ.get("MLFLOW_TRACKING_PASSWORD") or "password",
            host_header=os.environ.get("MLFLOW_HOST_HEADER"),
        )
    def _headers(self) -> dict[str, str]:
        return {"Host": self.host_header or "localhost"}
    def _post(self, path: str, body: dict) -> dict:
        with httpx.Client(trust_env=False, timeout=self.timeout) as c:
            r = c.post(f"{self._api}{path}", json=body, headers=self._headers(), auth=self._auth)
            r.raise_for_status()
            return r.json()
    def _get(self, path: str, params: dict | None = None) -> dict:
        with httpx.Client(trust_env=False, timeout=self.timeout) as c:
            r = c.get(f"{self._api}{path}", params=params or {}, headers=self._headers(), auth=self._auth)
            r.raise_for_status()
            return r.json()
    # ── Experiments ────────────────────────────────────────────────────
    def get_or_create_experiment(self, name: str) -> str:
        try:
            r = self._get("/experiments/get-by-name", {"experiment_name": name})
            return r["experiment"]["experiment_id"]
        except httpx.HTTPStatusError as e:
            if e.response.status_code not in (404, 400):
                raise
        r = self._post("/experiments/create", {"name": name})
        return r["experiment_id"]
    # ── Runs ───────────────────────────────────────────────────────────
    def create_run(
        self,
        experiment_id: str,
        run_name: str,
        tags: dict[str, str] | None = None,
    ) -> str:
        body: dict[str, Any] = {
            "experiment_id": experiment_id,
            "start_time": int(time.time() * 1000),
            "run_name": run_name,
            "tags": [
                {"key": k, "value": str(v)}
                for k, v in (tags or {}).items()
            ],
        }
        r = self._post("/runs/create", body)
        return r["run"]["info"]["run_id"]
    def log_param(self, run_id: str, key: str, value: Any) -> None:
        self._post("/runs/log-parameter", {"run_id": run_id, "key": key, "value": str(value)})
    def log_params(self, run_id: str, params: dict[str, Any]) -> None:
        for k, v in params.items():
            self.log_param(run_id, k, v)
    def log_metric(self, run_id: str, key: str, value: float, step: int = 0) -> None:
        self._post("/runs/log-metric", {
            "run_id": run_id,
            "key": key,
            "value": float(value),
            "timestamp": int(time.time() * 1000),
            "step": step,
        })
    def log_metrics(self, run_id: str, metrics: dict[str, float]) -> None:
        for k, v in metrics.items():
            self.log_metric(run_id, k, v)
    def set_tag(self, run_id: str, key: str, value: str) -> None:
        self._post("/runs/set-tag", {"run_id": run_id, "key": key, "value": str(value)})
    def set_tags(self, run_id: str, tags: dict[str, str]) -> None:
        for k, v in tags.items():
            self.set_tag(run_id, k, v)
    # MLflow tag values are capped at 5000 chars by the server (RESOURCE_DOES_NOT_EXIST
    # below that, INVALID_PARAMETER_VALUE above). 4500 leaves headroom for
    # internal metadata MLflow may append on its own.
    _TAG_VALUE_LIMIT = 4500
    def log_text(self, run_id: str, text: str, artifact_path: str) -> None:
        """Persist short text alongside the run.
        The MLflow server in this deployment uses a ``file://`` artifact
        backend, which is only reachable from inside the container — not
        via the REST proxy. We instead stash short payloads as tags
        keyed ``artifact:<path>``. Anything longer than 4500 chars is
        chunked into ``artifact:<path>:0``, ``:1`` …; ``get_artifact_text``
        re-stitches them in order.
        """
        key_base = f"artifact:{artifact_path}"
        if len(text) <= self._TAG_VALUE_LIMIT:
            self.set_tag(run_id, key_base, text)
            return
        # chunk
        for i in range(0, len(text), self._TAG_VALUE_LIMIT):
            self.set_tag(run_id, f"{key_base}:{i // self._TAG_VALUE_LIMIT}",
                          text[i:i + self._TAG_VALUE_LIMIT])
    def get_artifact_text(self, run_id: str, artifact_path: str) -> str:
        run = self._get("/runs/get", {"run_id": run_id})["run"]
        tags = {t["key"]: t["value"] for t in run["data"].get("tags", [])}
        key_base = f"artifact:{artifact_path}"
        if key_base in tags:
            return tags[key_base]
        # chunked form
        chunks = sorted(
            (k for k in tags if k.startswith(f"{key_base}:")),
            key=lambda k: int(k.rsplit(":", 1)[1]),
        )
        return "".join(tags[k] for k in chunks)
    def end_run(self, run_id: str, status: str = "FINISHED") -> None:
        self._post("/runs/update", {
            "run_id": run_id,
            "status": status,
            "end_time": int(time.time() * 1000),
        })
    def search_runs(
        self,
        experiment_id: str,
        filter_string: str = "",
        max_results: int = 1000,
    ) -> list[dict]:
        body = {
            "experiment_ids": [experiment_id],
            "filter": filter_string,
            "max_results": max_results,
        }
        r = self._post("/runs/search", body)
        return r.get("runs", [])
--- a/ml/experiments/bench/rubric.md
+++ b/ml/experiments/bench/rubric.md
@@ -0,0 +1,85 @@
 # Tip-quality rubric — `tip-v1`
 This file is the consistency anchor for the Claude Code judge. The same
 rubric is used across every judging session so verdicts are comparable
 across runs (per the lazy-judge pattern in #95).
 Each candidate tip is scored on three independent 1–5 dimensions, plus
 two binary flags. Score the **content of the tip itself** for the given
 persona/context — do not score the rationale.
 ## Dimensions
 ### relevance — 1 to 5
 How well does the tip respond to *this specific persona at this specific
 time*? A generic productivity platitude is 1; a tip that hooks into the
 persona's stated preferences and the actual hour-of-day is 5.
 | score | description |
 |-------|-------------|
 | 1 | Boilerplate. Could apply to any user, any time. |
 | 2 | Vaguely fits the persona but ignores context. |
 | 3 | Fits the persona OR the time, not both. |
 | 4 | Fits both persona and time, with one specific anchor (a task, an hour, a habit). |
 | 5 | Specific to the persona's preferences AND respects the hour, with a clear hook into a candidate task or routine. |
 ### actionability — 1 to 5
 Could the user *do this in the next 10 minutes* without further planning?
 "Try to focus more" is 1; "Spend 12 minutes on the Call dentist task and
 stop when the timer ends" is 5.
 | score | description |
 |-------|-------------|
 | 1 | Pure encouragement, no action. |
 | 2 | Action exists but vague ("review your tasks"). |
 | 3 | Concrete verb + object, but missing the time/duration handle. |
 | 4 | Concrete action with a duration or trigger ("for 10 minutes", "before lunch"). |
 | 5 | Micro-action with explicit start, duration, and a stop condition. |
 ### tone — 1 to 5
 Does the tip sound like a calm, specific mentor (the product voice) or
 like a generic chatbot/coach? Penalize emoji-spam, exclamation marks,
 hype words ("amazing!", "let's crush it!"), and corporate jargon.
 | score | description |
 |-------|-------------|
 | 1 | Hype, jargon, or motivational-poster tone. |
 | 2 | Polite chatbot tone, no warmth. |
 | 3 | Neutral, businesslike. |
 | 4 | Quiet and specific, like a coach who knows you. |
 | 5 | Earned. Reads like a mentor who has seen this exact stuck-pattern before. |
 ## Binary flags
 ### format_ok — 0 or 1
 1 if the *whole response* parsed as a JSON array of objects with the
 required keys (`id`, `content`, `rationale`). 0 otherwise. **This is
 computed automatically by `collect.py`** — judges should not override it.
 ### overlong — 0 or 1
 1 if `content` exceeds the documented 2-sentence cap (count sentence-
 ending punctuation `. ! ?`). Judges may flag this as a tiebreaker.
 ## Composite score
 `compare.py` ranks cells by:
 ```
 composite = relevance + actionability + tone + 2*format_ok - overlong
 ```
 i.e. format compliance is a doubled weight (a malformed JSON is a hard
 production failure regardless of how good the prose is).
 ## Calibration examples
 (Shared with judges so a 4 means the same thing across sessions.)
 **Persona**: deadline-driven (responds to overdue/high-priority,
 morning-active). **Hour**: 09:00. **Tasks include**: an overdue
 "Call dentist", priority 4.
 - "Stay focused and make today count!" — relevance 1, actionability 1, tone 1.
 - "Review your tasks and pick one that matters." — relevance 2, actionability 2, tone 3.
 - "Spend the next 12 minutes on Call dentist — set a timer and stop when it rings." — relevance 5, actionability 5, tone 4.
 - "It's 09:00 — you respond to overdue items best now. Block 12 minutes for Call dentist before your first meeting." — relevance 5, actionability 5, tone 5.
--- a/ml/experiments/bench/scenarios.py
+++ b/ml/experiments/bench/scenarios.py
@@ -0,0 +1,80 @@
 """Fixed contexts for the tip-generation benchmark.
 Every cell of the (model × prompt) grid is evaluated on the *same* set of
 scenarios so quality differences are attributable to the model/prompt,
 not to context variance.
 A scenario is one (persona, hour-of-day, candidate-task-pool) tuple. The
 hour and the task pool are seeded deterministically from the persona's
 name so the bench is reproducible across machines.
 """
 from __future__ import annotations
 import sys
 from dataclasses import dataclass
 from pathlib import Path
 # Reuse personas from sim — same source of truth for user archetypes.
 sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "sim"))
 from personas import PERSONAS, Persona  # type: ignore
 from task_generator import generate_task_pool  # type: ignore
@dataclass(frozen=True)
 class Scenario:
    id: str           # stable id used as MLflow tag — keep ASCII safe
    persona: Persona
    hour_of_day: int  # 0–23
    day_of_week: int  # 0=Mon
    tasks: list[dict]
    def to_prompt_context(self) -> dict:
        """Shape expected by ml/serving/prompts.PromptContext."""
        return {
            "tasks": [
                {
                    "content": t["content"],
                    "priority": t["features"]["priority"],
                    "is_overdue": t["features"]["is_overdue"],
                    "due_date": t.get("due_date", "no due date"),
                }
                for t in self.tasks
            ],
            "hour_of_day": self.hour_of_day,
            "day_of_week": self.day_of_week,
            "extra": {
                "persona": self.persona.name,
                "persona_hint": self.persona.description,
            },
        }
 # Two time-slots probe whether the model adapts its tone to the hour.
 # Morning (09) and evening (21) are picked because most personas have
 # strong directional preferences there.
 _TIME_SLOTS = [(9, 1), (21, 3)]   # (hour_of_day, day_of_week)
 def build_scenarios(tasks_per_scenario: int = 6) -> list[Scenario]:
    """Return a deterministic list of scenarios.
    With 4 personas × 2 time-slots = 8 scenarios. Task pools are seeded
    by ``hash(persona.name) + hour`` so runs are reproducible and each
    persona sees the same tasks at the same hour across cells.
    """
    out: list[Scenario] = []
    for persona in PERSONAS[:4]:
        for hour, dow in _TIME_SLOTS:
            seed = (abs(hash(persona.name)) % 9973) + hour
            tasks = generate_task_pool(n=tasks_per_scenario, seed=seed)
            out.append(
                Scenario(
                    id=f"{persona.name}-h{hour:02d}",
                    persona=persona,
                    hour_of_day=hour,
                    day_of_week=dow,
                    tasks=tasks,
                )
            )
    return out