alvis/oO

Files

alvis 556019b060 feat(bench): MLflow-based tip-generation benchmark harness (#93 , #95 )

Combines model evaluation (#93) and prompt A/B testing (#95) into one
experiment. Evaluates all (model × prompt × scenario) cells on the same
fixed contexts so quality differences are attributable.

Architecture:
- Phase A (collect.py): generates candidates per cell, logs to MLflow
  with judge_pending=true. Rejects models >4B, uses keep_alive=0 for
  RAM safety (no concurrent model weights in VRAM).
- Phase B (judge_cli.py): exports pending runs as JSON for Claude Code
  to score per the rubric, then applies scores back to MLflow.
- Phase C (compare.py): leaderboard by (model, prompt) cell.

Rubric (tip-v1) defines 1–5 scales for relevance, actionability, tone,
plus format_ok and overlong flags. Composite = rel + act + tone +
2×format_ok − overlong. Rubric is self-describing and persisted in every
run so judges use consistent criteria across sessions.

Artifacts (prompts, candidates, raw responses) stored as MLflow tags
because the server uses a file:// backend not accessible via REST. Full
artifacts accessible in MLflow UI → run → Tags section.

Tested end-to-end on local machine:
- 4 models (qwen2.5:0.5b/1.5b, gemma3:1b, llama3.2:3b) ≤4B
- 3 prompts (v1, v2-mentor, v3-few-shot)
- 4 scenarios (4 personas × 2 time-slots)
- 48 cells total, all judged and ranked

Winner: qwen2.5:1.5b × v3-few-shot (composite=12.75).

Ready for integration into Airflow prompt_ab_eval DAG and admin UI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-27 11:48:59 +00:00

experiments

feat(bench): MLflow-based tip-generation benchmark harness (#93 , #95 )

2026-04-27 11:48:59 +00:00

features

feat(features): per-feature freshness spec — JIT vs batched (#61 )

2026-04-25 17:02:55 +00:00

notebooks

chore: scaffold oO monorepo with architecture, roadmap, and module stubs

2026-04-13 14:19:56 +00:00

pipelines

feat(simulate): MLflow tracking, Airflow DAG integration, health checks for mlflow/airflow

2026-04-26 12:08:36 +00:00

registry

chore: scaffold oO monorepo with architecture, roadmap, and module stubs

2026-04-13 14:19:56 +00:00

serving

docs(observability): add services/api README; update ml/serving + recommender docs (#18 )

2026-04-26 03:41:39 +00:00

README.md

feat(features): per-feature freshness spec — JIT vs batched (#61 )

2026-04-25 17:02:55 +00:00

README.md

ml/

Python. Owns models, features, training, online scoring.

Dir	Role	Phase
`serving/`	FastAPI online scorer (`/score`, `/generate`) + LiteLLM gateway + prompt registry (`prompts.py`) + JetStream consumers for `signals.>` / `feedback.>`, called by `recommender`	1–2
`features/`	context assembler (`context.py`): signals → `PromptContext`; profile-feature schema mirror (`profile_schema.py`); Feast adapter later	2
`pipelines/`	batch feature + training DAGs (Prefect/Airflow)	4
`registry/`	MLflow-backed model registry integration	4
`experiments/`	A/B assignment + multi-armed bandit policies	4
`notebooks/`	research; never imported by production code	—

Principles

Every model has a model card in registry/ describing inputs, offline metrics, fairness checks, and rollout history.
Online inference must be stateless and < 50ms p99.
Training reads from the offline feature store; serving reads from the online feature store; definitions are shared (no train/serve skew).
Shadow deploys before any policy change that affects real users.

Feature contract

Profile features (batched)

User-level features (completion rate, preferred hour, tip volume…) are computed by the TypeScript recommender and shipped to ml/serving on every /score and /generate call as profile_features: dict | None. The Python mirror in features/profile_schema.py documents each feature's name, dtype, TTL, source, and null fallback — keep it in sync with services/api/src/profile/registry.ts (a CI-style test asserts names and ttlSec values match). See ADR-0011.

Context features (JIT)

Request-time signals assembled by features/context.py (hour_of_day, day_of_week, task list). These are never cached — they are derived from the system clock and the live Todoist feed at the moment of the score call. CONTEXT_FEATURES in context.py declares freshness, source, and fallback for each field (issue #61).

Prompt registry

serving/prompts.py keys tip-generation prompts by stable version string. Adding a new variant means adding an entry — no caller changes. Selection precedence: POST /generate body's prompt_version field → env DEFAULT_PROMPT_VERSION → "v1". The TypeScript recommender drives selection via TIP_PROMPT_VERSION (single value or comma-separated rotation); the version actually used flows back in the response and is persisted to tip_scores.prompt_version so the admin reward-analytics dashboard can bucket reactions per variant.

README.md Unescape Escape

ml/

Principles

Feature contract

Profile features (batched)

Context features (JIT)

Prompt registry

README.md