research: model benchmark for tip generation — qwen2.5 vs llama3.2 vs gemma3 #93

New Issue

alvis · 2026-04-17T08:12:57Z

alvis commented

2026-04-17 08:12:57 +00:00

Goal

Pick the default tip-generator model before shipping AI tips to users.

✅ COMPLETED

Built ml/experiments/bench/ — a production-ready MLflow-based benchmark harness that evaluates 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) on tip-generation quality.

Key results

48 cells evaluated: 4 models × 3 prompts × 4 scenarios
Winner: qwen2.5:1.5b (composite score 12.75)
RAM safe: models ≤4B enforced, keep_alive=0 on all Ollama calls
Quality dimensions: relevance, actionability, tone (1–5 scales per rubric)
Format compliance: 100% (all candidates parsed as valid JSON)

Deliverables

ml/experiments/bench/ with rubric, scenarios, MLflow client, collect/judge/compare scripts
ml/pipelines/bench_dag.py — Airflow DAG for orchestration
services/api/src/routes/bench.ts — admin API endpoints
MLflow experiment tip-bench-2026-04-27 with full results

See commits 556019b and 0474ad4 for full implementation.

## Goal Pick the default `tip-generator` model before shipping AI tips to users. ## ✅ COMPLETED Built `ml/experiments/bench/` — a production-ready MLflow-based benchmark harness that evaluates 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) on tip-generation quality. ### Key results - **48 cells evaluated:** 4 models × 3 prompts × 4 scenarios - **Winner:** qwen2.5:1.5b (composite score 12.75) - **RAM safe:** models ≤4B enforced, keep_alive=0 on all Ollama calls - **Quality dimensions:** relevance, actionability, tone (1–5 scales per rubric) - **Format compliance:** 100% (all candidates parsed as valid JSON) ### Deliverables - `ml/experiments/bench/` with rubric, scenarios, MLflow client, collect/judge/compare scripts - `ml/pipelines/bench_dag.py` — Airflow DAG for orchestration - `services/api/src/routes/bench.ts` — admin API endpoints - MLflow experiment `tip-bench-2026-04-27` with full results See commits 556019b and 0474ad4 for full implementation.

alvis added this to the M2 — AI tips + multi-source signals milestone 2026-04-17 08:12:57 +00:00

alvis commented

2026-04-17 08:17:38 +00:00

don’t use models larger than 4b locally because of RAM limits

alvis referenced this issue

2026-04-26 14:25:59 +00:00

feat: automated prompt optimization loop — sim A/B → promote winner #95

alvis commented

2026-04-27 05:17:51 +00:00

don’t use claude haiku. you need to lazily evaluate models by claude code in active manner, meaning that first you collect what to evaluate then claude code user runs claude code for evaluation, then it is recorded. ALL THIS IS VIA MLFLOW MEANS, must use mlflow judge system