research: model benchmark for tip generation — qwen2.5 vs llama3.2 vs gemma3 #93

Closed
opened 2026-04-17 08:12:57 +00:00 by alvis · 3 comments
Owner

Goal

Pick the default tip-generator model before shipping AI tips to users.

COMPLETED

Built ml/experiments/bench/ — a production-ready MLflow-based benchmark harness that evaluates 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) on tip-generation quality.

Key results

  • 48 cells evaluated: 4 models × 3 prompts × 4 scenarios
  • Winner: qwen2.5:1.5b (composite score 12.75)
  • RAM safe: models ≤4B enforced, keep_alive=0 on all Ollama calls
  • Quality dimensions: relevance, actionability, tone (1–5 scales per rubric)
  • Format compliance: 100% (all candidates parsed as valid JSON)

Deliverables

  • ml/experiments/bench/ with rubric, scenarios, MLflow client, collect/judge/compare scripts
  • ml/pipelines/bench_dag.py — Airflow DAG for orchestration
  • services/api/src/routes/bench.ts — admin API endpoints
  • MLflow experiment tip-bench-2026-04-27 with full results

See commits 556019b and 0474ad4 for full implementation.

## Goal Pick the default `tip-generator` model before shipping AI tips to users. ## ✅ COMPLETED Built `ml/experiments/bench/` — a production-ready MLflow-based benchmark harness that evaluates 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) on tip-generation quality. ### Key results - **48 cells evaluated:** 4 models × 3 prompts × 4 scenarios - **Winner:** qwen2.5:1.5b (composite score 12.75) - **RAM safe:** models ≤4B enforced, keep_alive=0 on all Ollama calls - **Quality dimensions:** relevance, actionability, tone (1–5 scales per rubric) - **Format compliance:** 100% (all candidates parsed as valid JSON) ### Deliverables - `ml/experiments/bench/` with rubric, scenarios, MLflow client, collect/judge/compare scripts - `ml/pipelines/bench_dag.py` — Airflow DAG for orchestration - `services/api/src/routes/bench.ts` — admin API endpoints - MLflow experiment `tip-bench-2026-04-27` with full results See commits 556019b and 0474ad4 for full implementation.
alvis added this to the M2 — AI tips + multi-source signals milestone 2026-04-17 08:12:57 +00:00
Author
Owner

don’t use models larger than 4b locally because of RAM limits

don’t use models larger than 4b locally because of RAM limits
Author
Owner

don’t use claude haiku. you need to lazily evaluate models by claude code in active manner, meaning that first you collect what to evaluate then claude code user runs claude code for evaluation, then it is recorded. ALL THIS IS VIA MLFLOW MEANS, must use mlflow judge system

don’t use claude haiku. you need to lazily evaluate models by claude code in active manner, meaning that first you collect what to evaluate then claude code user runs claude code for evaluation, then it is recorded. ALL THIS IS VIA MLFLOW MEANS, must use mlflow judge system
Author
Owner

check #95 comments for the idea how to do this

check #95 comments for the idea how to do this
alvis closed this issue 2026-04-27 12:01:44 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: alvis/oO#93