feat: M2 AI tips — LiteLLM gateway, context assembler, end-to-end generation pipeline

Issues closed: #86, #87, #88, #89, #90, #91, #79, #80, #82 infra: - docker-compose `ai` profile: Ollama + LiteLLM services - infra/litellm/litellm_config.yaml: tip-generator / embedder / judge aliases - .env.example: LITELLM_URL, LITELLM_MASTER_KEY, OLLAMA_URL ml/serving: - POST /generate: calls LiteLLM tip-generator alias, returns TipCandidate[] - JSON retry loop (2 retries with correction prompt on malformed response) - _parse_llm_json strips markdown fences ml/features: - context.py: build_context() assembles user signals → PromptContext (sorts overdue/high-priority tasks first for LLM prompt quality) shared-types: - TipKind, TipSource, TipCandidate types - Tip gains kind + rationale fields services/api: - recommender: 3-stage pipeline (assemble → score → serve) Stage 1: Todoist tasks + LLM candidates fetched in parallel Stage 2: egreedy bandit scores merged candidate pool Stage 3: serve + log with prompt_version, llm_model, tip_kind - tip_scores: prompt_version, llm_model, tip_kind columns + migrations - config: LITELLM_URL added - integrations: surface token_status in /integrations response tests: - ml/serving/tests/test_generate.py: 13 tests (retry, 502/503, fence variants) - ml/features/test_context.py: 9 tests (sorting, edge cases) - services/api recommender.unit.test.ts: 16 pure-function tests (inferReward, dueAgeDays) - services/api recommender.test.ts: 4 integration tests (tip_scores columns, LLM fallback) - shared-types: TipCandidate, rationale, full TipFeedback action set docs: - ADR-0008: LiteLLM AI gateway decision - overview.md: M2 pipeline description updated - ml/README.md: serving + features roles updated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:09:02 +00:00
parent 85367aeaa0
commit ffdf70733f
22 changed files with 1017 additions and 45 deletions
--- a/docs/adr/0008-litellm-ai-gateway.md
+++ b/docs/adr/0008-litellm-ai-gateway.md
@@ -0,0 +1,41 @@
+# ADR-0008 — LiteLLM as AI gateway; model aliases decouple code from model names
+
+**Status:** Accepted  
+**Date:** 2026-04-17  
+**Milestone:** M2
+
+## Context
+
+M2 requires LLM inference for tip generation (`ml/serving POST /generate`). We need a way to:
+- Run locally during development without cloud API keys.
+- Switch models (qwen2.5 → llama3.2, or cloud fallback) without touching application code.
+- Share the LLM infrastructure with other local services on Agap.
+
+## Decision
+
+Route all LLM calls through **LiteLLM** (`http://localhost:4000` in dev, `llm.alogins.net` in prod) backed by **Ollama** for local inference.
+
+Application code references model aliases — never bare model names:
+
+| Alias | Default model | Used by |
+|-------|--------------|---------|
+| `tip-generator` | `qwen2.5:7b` | `ml/serving POST /generate` |
+| `embedder` | `nomic-embed-text` | task clustering, dedup (M4) |
+| `judge` | `claude-haiku-4-5` | offline simulation only |
+
+Config is in `infra/litellm/litellm_config.yaml`. Swapping a model = one YAML change, zero code change.
+
+`ml/serving` reads `LITELLM_URL` and `LITELLM_MASTER_KEY` from env. TypeScript services never call LLM endpoints directly — all inference flows through `ml/serving`.
+
+## Consequences
+
+- **Local dev:** `docker compose --profile ai up` starts Ollama + LiteLLM. First run pulls models (~4 GB for qwen2.5:7b).
+- **Prod:** both are shared Agap services; set `LITELLM_URL=http://llm.alogins.net` in `.env.local`.
+- **Offline sim:** `judge` alias points at `claude-haiku-4-5` (cloud) — requires `ANTHROPIC_API_KEY`; simulation is opt-in.
+- **Vendor lock-in:** none at the code level. LiteLLM translates the OpenAI-compatible API to whatever backend.
+- **Observability:** LiteLLM logs all requests; `tip_scores.llm_model` + `tip_scores.prompt_version` track which model + prompt generated each served tip.
+
+## Alternatives considered
+
+- **Call Ollama directly:** cheaper in latency, but ties code to Ollama's API format and makes cloud fallback a code change.
+- **Call Anthropic directly from TS:** violates the rule that TS services never hold model names (CLAUDE.md prime directive 3).
--- a/docs/architecture/overview.md
+++ b/docs/architecture/overview.md
@@ -82,6 +82,8 @@ client ─► gateway ─► recommender (TS)
                          ◄─  best TipCandidate
 ```

-**Phase 1 (current):** candidates come from Todoist task list, no LLM. The bandit scores tasks directly.
+**Phase 1 (shipped M1):** candidates come from Todoist task list, no LLM. The bandit scores tasks directly.
+
+**Phase 2 (shipped M2):** LLM candidates are generated in parallel with Todoist fetch. Both pools are merged, scored by the bandit, and the winner served. `tip_scores` tracks `prompt_version`, `llm_model`, and `tip_kind` for every row.

 Feedback: `POST /feedback → events.emit(reaction)` → online bandit update + `prompt_version` tracked for A/B analysis.