alvis/oO

Files

alvis f8d66aa01f chore: remove Airflow completely from the stack

Drop all four Airflow containers (db, init, webserver, scheduler) from the
mlops compose profile, leaving MLflow as the sole mlops service. Remove
AIRFLOW_* env vars, config fields, health-check entries, DAG trigger code
in admin/bench routes, the airflow_dag_run_id schema column, Airflow nav
links and DAG-run links in the admin UI, the two Airflow DAG files
(bench_dag.py, sim_dag.py), and all related docs/ADR references.
Simulations now run exclusively via the subprocess path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-03 16:38:46 +00:00

3.4 KiB

Raw Blame History

`bench/` — combined model + prompt evaluation harness

Combines the work of issues #93 (model benchmark) and #95 (prompt A/B) into one MLflow-tracked experiment. Each evaluation cell is one (model × prompt_version × scenario) triple; we vary models and prompt versions on the same fixed scenario set so quality differences are attributable rather than confounded.

Pieces

File	Purpose
`rubric.md`	The scoring rubric (`tip-v1`). Anchor for the human judge across sessions.
`scenarios.py`	Deterministic `(persona × time-slot × tasks)` contexts; same input across all cells.
`mlflow_client.py`	Thin httpx-based MLflow REST wrapper. Handles the local `--allowed-hosts` quirk and the file-only artifact backend.
`collect.py`	Phase A. Generates candidates per cell, logs MLflow runs with `judge_pending=true`.
`judge_cli.py`	Phase B. `--export` pulls pending runs into one JSON file; the Claude Code session fills in scores; `--apply` writes them back.
`compare.py`	Phase C. Leaderboard per `(model, prompt)` cell.

RAM safety (#93 hard requirement)

Models > 4B are rejected up front by collect.py --max-model-b 4.0.
Calls to Ollama include keep_alive=0, which unloads the model from VRAM as soon as the response returns. We never hold two LLM weights concurrently.
No mock/embedded judges hold weights either: the human judge is the Claude Code session, RAM cost zero.

The pipeline can run on a 15 GiB / 8 GiB-VRAM box (1070-class GPU) end to end without paging.

Quick start

# 1. Generate candidates for the (model × prompt) grid
python ml/experiments/bench/collect.py \
    --models qwen2.5:0.5b,qwen2.5:1.5b,gemma3:1b,llama3.2:3b \
    --prompts v1,v2-mentor,v3-few-shot \
    --experiment tip-bench-2026-04-27 \
    --n-tips 5 \
    --diversity

# 2. Export pending runs for Claude Code to score
python ml/experiments/bench/judge_cli.py \
    --experiment tip-bench-2026-04-27 \
    --export /tmp/oo-bench-judge.json

# 3. (Claude Code edits /tmp/oo-bench-judge.json, fills scores per rubric.md.)

# 4. Push scores back to MLflow
python ml/experiments/bench/judge_cli.py \
    --experiment tip-bench-2026-04-27 \
    --apply /tmp/oo-bench-judge.json

# 5. Leaderboard
python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27

Why the rubric matters

Different judging sessions need to be comparable. rubric.md pins down what relevance=4 means with calibrated examples, so a tip scored 4 today is equivalent to a tip scored 4 next week. Without the rubric, the "lazy human-in-the-loop" judge drifts.

Accessing results in MLflow

Each run's quality scores (relevance, actionability, tone, composite) are stored as metrics on the MLflow run — accessible via:

MLflow UI: experiment tip-bench-2026-04-27 → click any run → Metrics section
Leaderboard: python ml/experiments/bench/compare.py --experiment tip-bench-2026-04-27
Raw API: mlflow_client.search_runs() filters and pulls metrics in bulk

Candidate tips, prompts, and raw responses are stored as tags with keys artifact:candidates.json, artifact:prompt.txt, artifact:raw.txt (tag fallback because the MLflow server uses a file:// artifact backend not accessible via REST from the host).

Running standalone

The pipeline runs on any machine with:

Ollama models ≤4B
MLflow tracking server
Python 3.10+

3.4 KiB Raw Blame History Unescape Escape

bench/ — combined model + prompt evaluation harness