research: LLM prompt strategies for tip generation quality #84

Closed
opened 2026-04-16 15:26:11 +00:00 by alvis · 0 comments
Owner

Motivation

The quality of AI-generated tips is critical to the "feels like magic" product goal. We need to research prompt strategies that produce concise, actionable, well-timed advice.

Research areas

1. Prompt template design

  • System prompt: persona (wise mentor? efficient assistant? gentle coach?)
  • Context injection: which signals to include, how to summarize
  • Output constraints: max length, tone, actionability requirements
  • Few-shot examples: curated tip examples that define quality

2. Context window management

  • How much user history to include (last N signals? aggregated profile?)
  • Recency bias vs. pattern detection tradeoff
  • Token budget allocation (signals vs. instructions vs. examples)

3. Model selection

  • Benchmark local models: llama3.1-8b, mistral-7b, phi-3, gemma-2
  • Latency vs. quality tradeoff (8B vs 13B vs 70B)
  • Quantization impact on tip quality

4. Tip quality evaluation

  • Automated eval: LLM-as-judge (Claude) scoring on actionability, specificity, timing-awareness
  • Human eval: rate tips on 1-5 scale across dimensions
  • A/B: AI tips vs raw tasks in production (reward signal comparison)

Tasks

  • Design 3-5 prompt templates with different personas
  • Benchmark on synthetic scenarios (diverse user profiles + signal states)
  • Model comparison: latency + quality matrix
  • Build eval harness (LLM judge + human rating UI in admin)
  • Integrate winning prompt into tip generation pipeline
## Motivation The quality of AI-generated tips is critical to the "feels like magic" product goal. We need to research prompt strategies that produce concise, actionable, well-timed advice. ## Research areas ### 1. Prompt template design - System prompt: persona (wise mentor? efficient assistant? gentle coach?) - Context injection: which signals to include, how to summarize - Output constraints: max length, tone, actionability requirements - Few-shot examples: curated tip examples that define quality ### 2. Context window management - How much user history to include (last N signals? aggregated profile?) - Recency bias vs. pattern detection tradeoff - Token budget allocation (signals vs. instructions vs. examples) ### 3. Model selection - Benchmark local models: llama3.1-8b, mistral-7b, phi-3, gemma-2 - Latency vs. quality tradeoff (8B vs 13B vs 70B) - Quantization impact on tip quality ### 4. Tip quality evaluation - Automated eval: LLM-as-judge (Claude) scoring on actionability, specificity, timing-awareness - Human eval: rate tips on 1-5 scale across dimensions - A/B: AI tips vs raw tasks in production (reward signal comparison) ## Tasks - [ ] Design 3-5 prompt templates with different personas - [ ] Benchmark on synthetic scenarios (diverse user profiles + signal states) - [ ] Model comparison: latency + quality matrix - [ ] Build eval harness (LLM judge + human rating UI in admin) - [ ] Integrate winning prompt into tip generation pipeline
alvis added this to the M2 — AI tips + multi-source signals milestone 2026-04-16 15:26:11 +00:00
alvis closed this issue 2026-04-24 15:44:05 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: alvis/oO#84