research: LLM fine-tuning pipeline — tip reactions as training signal #96

Closed
opened 2026-04-17 08:14:04 +00:00 by alvis · 0 comments
Owner

Goal

Use accumulated tip reaction data to fine-tune the tip-generator model. Tips that got done in the magic zone (15s–2min) are positive examples; dismiss within 5s are negative.

Training data construction

# Positive: done in magic zone
SELECT context_json, tip_content, kind 
FROM tip_scores 
WHERE source = 'llm' AND reaction = 'done' AND dwell_ms BETWEEN 15000 AND 120000

# Negative: quick dismiss
SELECT context_json, tip_content, kind 
FROM tip_scores 
WHERE source = 'llm' AND reaction = 'dismiss' AND dwell_ms < 5000

Pipeline (Airflow DAG finetune_tip_generator)

  1. Extract training pairs from tip_scores (requires ≥500 positive examples)
  2. Format as instruction-tuning dataset (Alpaca format)
  3. Run LoRA fine-tuning on base model (Unsloth or Axolotl)
  4. Log run to MLflow; register new model version
  5. Shadow-deploy: route 5% of tips to fine-tuned model, compare reactions

Notes

  • Minimum data threshold: ~500 positive + 500 negative examples before first run
  • Fine-tuned model served via Ollama (GGUF export)
  • Privacy: training data must be de-identified; user IDs hashed
  • Blocked by: #88 (context assembler must log context_json to tip_scores), #37 (Airflow)
## Goal Use accumulated tip reaction data to fine-tune the `tip-generator` model. Tips that got `done` in the magic zone (15s–2min) are positive examples; `dismiss` within 5s are negative. ## Training data construction ```python # Positive: done in magic zone SELECT context_json, tip_content, kind FROM tip_scores WHERE source = 'llm' AND reaction = 'done' AND dwell_ms BETWEEN 15000 AND 120000 # Negative: quick dismiss SELECT context_json, tip_content, kind FROM tip_scores WHERE source = 'llm' AND reaction = 'dismiss' AND dwell_ms < 5000 ``` ## Pipeline (Airflow DAG `finetune_tip_generator`) 1. Extract training pairs from `tip_scores` (requires ≥500 positive examples) 2. Format as instruction-tuning dataset (Alpaca format) 3. Run LoRA fine-tuning on base model (Unsloth or Axolotl) 4. Log run to MLflow; register new model version 5. Shadow-deploy: route 5% of tips to fine-tuned model, compare reactions ## Notes - Minimum data threshold: ~500 positive + 500 negative examples before first run - Fine-tuned model served via Ollama (GGUF export) - Privacy: training data must be de-identified; user IDs hashed - Blocked by: #88 (context assembler must log context_json to tip_scores), #37 (Airflow)
alvis added this to the M4 — MLOps at scale milestone 2026-04-17 08:14:04 +00:00
alvis closed this issue 2026-05-14 10:44:29 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: alvis/oO#96