research: LLM fine-tuning pipeline — tip reactions as training signal #96

New Issue

alvis · 2026-04-17T08:14:04Z

alvis commented

2026-04-17 08:14:04 +00:00

Goal

Use accumulated tip reaction data to fine-tune the tip-generator model. Tips that got done in the magic zone (15s–2min) are positive examples; dismiss within 5s are negative.

Training data construction

# Positive: done in magic zone
SELECT context_json, tip_content, kind 
FROM tip_scores 
WHERE source = 'llm' AND reaction = 'done' AND dwell_ms BETWEEN 15000 AND 120000

# Negative: quick dismiss
SELECT context_json, tip_content, kind 
FROM tip_scores 
WHERE source = 'llm' AND reaction = 'dismiss' AND dwell_ms < 5000

Pipeline (Airflow DAG `finetune_tip_generator`)

Extract training pairs from tip_scores (requires ≥500 positive examples)
Format as instruction-tuning dataset (Alpaca format)
Run LoRA fine-tuning on base model (Unsloth or Axolotl)
Log run to MLflow; register new model version
Shadow-deploy: route 5% of tips to fine-tuned model, compare reactions

Notes

Minimum data threshold: ~500 positive + 500 negative examples before first run
Fine-tuned model served via Ollama (GGUF export)
Privacy: training data must be de-identified; user IDs hashed
Blocked by: #88 (context assembler must log context_json to tip_scores), #37 (Airflow)

## Goal Use accumulated tip reaction data to fine-tune the `tip-generator` model. Tips that got `done` in the magic zone (15s–2min) are positive examples; `dismiss` within 5s are negative. ## Training data construction ```python # Positive: done in magic zone SELECT context_json, tip_content, kind FROM tip_scores WHERE source = 'llm' AND reaction = 'done' AND dwell_ms BETWEEN 15000 AND 120000 # Negative: quick dismiss SELECT context_json, tip_content, kind FROM tip_scores WHERE source = 'llm' AND reaction = 'dismiss' AND dwell_ms < 5000 ``` ## Pipeline (Airflow DAG `finetune_tip_generator`) 1. Extract training pairs from `tip_scores` (requires ≥500 positive examples) 2. Format as instruction-tuning dataset (Alpaca format) 3. Run LoRA fine-tuning on base model (Unsloth or Axolotl) 4. Log run to MLflow; register new model version 5. Shadow-deploy: route 5% of tips to fine-tuned model, compare reactions ## Notes - Minimum data threshold: ~500 positive + 500 negative examples before first run - Fine-tuned model served via Ollama (GGUF export) - Privacy: training data must be de-identified; user IDs hashed - Blocked by: #88 (context assembler must log context_json to tip_scores), #37 (Airflow)

alvis added this to the M4 — MLOps at scale milestone 2026-04-17 08:14:04 +00:00

alvis closed this issue

2026-05-14 10:44:29 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: alvis/oO#96

research: LLM fine-tuning pipeline — tip reactions as training signal #96

Goal

Training data construction

Pipeline (Airflow DAG finetune_tip_generator)

Notes

Pipeline (Airflow DAG `finetune_tip_generator`)