feat: M2 AI tips — LiteLLM gateway, context assembler, end-to-end generation pipeline

Issues closed: #86, #87, #88, #89, #90, #91, #79, #80, #82

infra:
- docker-compose `ai` profile: Ollama + LiteLLM services
- infra/litellm/litellm_config.yaml: tip-generator / embedder / judge aliases
- .env.example: LITELLM_URL, LITELLM_MASTER_KEY, OLLAMA_URL

ml/serving:
- POST /generate: calls LiteLLM tip-generator alias, returns TipCandidate[]
- JSON retry loop (2 retries with correction prompt on malformed response)
- _parse_llm_json strips markdown fences

ml/features:
- context.py: build_context() assembles user signals → PromptContext
  (sorts overdue/high-priority tasks first for LLM prompt quality)

shared-types:
- TipKind, TipSource, TipCandidate types
- Tip gains kind + rationale fields

services/api:
- recommender: 3-stage pipeline (assemble → score → serve)
  Stage 1: Todoist tasks + LLM candidates fetched in parallel
  Stage 2: egreedy bandit scores merged candidate pool
  Stage 3: serve + log with prompt_version, llm_model, tip_kind
- tip_scores: prompt_version, llm_model, tip_kind columns + migrations
- config: LITELLM_URL added
- integrations: surface token_status in /integrations response

tests:
- ml/serving/tests/test_generate.py: 13 tests (retry, 502/503, fence variants)
- ml/features/test_context.py: 9 tests (sorting, edge cases)
- services/api recommender.unit.test.ts: 16 pure-function tests (inferReward, dueAgeDays)
- services/api recommender.test.ts: 4 integration tests (tip_scores columns, LLM fallback)
- shared-types: TipCandidate, rationale, full TipFeedback action set

docs:
- ADR-0008: LiteLLM AI gateway decision
- overview.md: M2 pipeline description updated
- ml/README.md: serving + features roles updated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-17 14:09:02 +00:00
parent 85367aeaa0
commit ffdf70733f
22 changed files with 1017 additions and 45 deletions

View File

@@ -0,0 +1,41 @@
# ADR-0008 — LiteLLM as AI gateway; model aliases decouple code from model names
**Status:** Accepted
**Date:** 2026-04-17
**Milestone:** M2
## Context
M2 requires LLM inference for tip generation (`ml/serving POST /generate`). We need a way to:
- Run locally during development without cloud API keys.
- Switch models (qwen2.5 → llama3.2, or cloud fallback) without touching application code.
- Share the LLM infrastructure with other local services on Agap.
## Decision
Route all LLM calls through **LiteLLM** (`http://localhost:4000` in dev, `llm.alogins.net` in prod) backed by **Ollama** for local inference.
Application code references model aliases — never bare model names:
| Alias | Default model | Used by |
|-------|--------------|---------|
| `tip-generator` | `qwen2.5:7b` | `ml/serving POST /generate` |
| `embedder` | `nomic-embed-text` | task clustering, dedup (M4) |
| `judge` | `claude-haiku-4-5` | offline simulation only |
Config is in `infra/litellm/litellm_config.yaml`. Swapping a model = one YAML change, zero code change.
`ml/serving` reads `LITELLM_URL` and `LITELLM_MASTER_KEY` from env. TypeScript services never call LLM endpoints directly — all inference flows through `ml/serving`.
## Consequences
- **Local dev:** `docker compose --profile ai up` starts Ollama + LiteLLM. First run pulls models (~4 GB for qwen2.5:7b).
- **Prod:** both are shared Agap services; set `LITELLM_URL=http://llm.alogins.net` in `.env.local`.
- **Offline sim:** `judge` alias points at `claude-haiku-4-5` (cloud) — requires `ANTHROPIC_API_KEY`; simulation is opt-in.
- **Vendor lock-in:** none at the code level. LiteLLM translates the OpenAI-compatible API to whatever backend.
- **Observability:** LiteLLM logs all requests; `tip_scores.llm_model` + `tip_scores.prompt_version` track which model + prompt generated each served tip.
## Alternatives considered
- **Call Ollama directly:** cheaper in latency, but ties code to Ollama's API format and makes cloud fallback a code change.
- **Call Anthropic directly from TS:** violates the rule that TS services never hold model names (CLAUDE.md prime directive 3).

View File

@@ -82,6 +82,8 @@ client ─► gateway ─► recommender (TS)
◄─ best TipCandidate
```
**Phase 1 (current):** candidates come from Todoist task list, no LLM. The bandit scores tasks directly.
**Phase 1 (shipped M1):** candidates come from Todoist task list, no LLM. The bandit scores tasks directly.
**Phase 2 (shipped M2):** LLM candidates are generated in parallel with Todoist fetch. Both pools are merged, scored by the bandit, and the winner served. `tip_scores` tracks `prompt_version`, `llm_model`, and `tip_kind` for every row.
Feedback: `POST /feedback → events.emit(reaction)` → online bandit update + `prompt_version` tracked for A/B analysis.