research: model benchmark for tip generation — qwen2.5 vs llama3.2 vs gemma3 #93
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Goal
Pick the default
tip-generatormodel before shipping AI tips to users.✅ COMPLETED
Built
ml/experiments/bench/— a production-ready MLflow-based benchmark harness that evaluates 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) on tip-generation quality.Key results
Deliverables
ml/experiments/bench/with rubric, scenarios, MLflow client, collect/judge/compare scriptsml/pipelines/bench_dag.py— Airflow DAG for orchestrationservices/api/src/routes/bench.ts— admin API endpointstip-bench-2026-04-27with full resultsSee commits
556019band0474ad4for full implementation.don’t use models larger than 4b locally because of RAM limits
don’t use claude haiku. you need to lazily evaluate models by claude code in active manner, meaning that first you collect what to evaluate then claude code user runs claude code for evaluation, then it is recorded. ALL THIS IS VIA MLFLOW MEANS, must use mlflow judge system
check #95 comments for the idea how to do this