Issues closed: #86, #87, #88, #89, #90, #91, #79, #80, #82 infra: - docker-compose `ai` profile: Ollama + LiteLLM services - infra/litellm/litellm_config.yaml: tip-generator / embedder / judge aliases - .env.example: LITELLM_URL, LITELLM_MASTER_KEY, OLLAMA_URL ml/serving: - POST /generate: calls LiteLLM tip-generator alias, returns TipCandidate[] - JSON retry loop (2 retries with correction prompt on malformed response) - _parse_llm_json strips markdown fences ml/features: - context.py: build_context() assembles user signals → PromptContext (sorts overdue/high-priority tasks first for LLM prompt quality) shared-types: - TipKind, TipSource, TipCandidate types - Tip gains kind + rationale fields services/api: - recommender: 3-stage pipeline (assemble → score → serve) Stage 1: Todoist tasks + LLM candidates fetched in parallel Stage 2: egreedy bandit scores merged candidate pool Stage 3: serve + log with prompt_version, llm_model, tip_kind - tip_scores: prompt_version, llm_model, tip_kind columns + migrations - config: LITELLM_URL added - integrations: surface token_status in /integrations response tests: - ml/serving/tests/test_generate.py: 13 tests (retry, 502/503, fence variants) - ml/features/test_context.py: 9 tests (sorting, edge cases) - services/api recommender.unit.test.ts: 16 pure-function tests (inferReward, dueAgeDays) - services/api recommender.test.ts: 4 integration tests (tip_scores columns, LLM fallback) - shared-types: TipCandidate, rationale, full TipFeedback action set docs: - ADR-0008: LiteLLM AI gateway decision - overview.md: M2 pipeline description updated - ml/README.md: serving + features roles updated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2.1 KiB
2.1 KiB
ADR-0008 — LiteLLM as AI gateway; model aliases decouple code from model names
Status: Accepted
Date: 2026-04-17
Milestone: M2
Context
M2 requires LLM inference for tip generation (ml/serving POST /generate). We need a way to:
- Run locally during development without cloud API keys.
- Switch models (qwen2.5 → llama3.2, or cloud fallback) without touching application code.
- Share the LLM infrastructure with other local services on Agap.
Decision
Route all LLM calls through LiteLLM (http://localhost:4000 in dev, llm.alogins.net in prod) backed by Ollama for local inference.
Application code references model aliases — never bare model names:
| Alias | Default model | Used by |
|---|---|---|
tip-generator |
qwen2.5:7b |
ml/serving POST /generate |
embedder |
nomic-embed-text |
task clustering, dedup (M4) |
judge |
claude-haiku-4-5 |
offline simulation only |
Config is in infra/litellm/litellm_config.yaml. Swapping a model = one YAML change, zero code change.
ml/serving reads LITELLM_URL and LITELLM_MASTER_KEY from env. TypeScript services never call LLM endpoints directly — all inference flows through ml/serving.
Consequences
- Local dev:
docker compose --profile ai upstarts Ollama + LiteLLM. First run pulls models (~4 GB for qwen2.5:7b). - Prod: both are shared Agap services; set
LITELLM_URL=http://llm.alogins.netin.env.local. - Offline sim:
judgealias points atclaude-haiku-4-5(cloud) — requiresANTHROPIC_API_KEY; simulation is opt-in. - Vendor lock-in: none at the code level. LiteLLM translates the OpenAI-compatible API to whatever backend.
- Observability: LiteLLM logs all requests;
tip_scores.llm_model+tip_scores.prompt_versiontrack which model + prompt generated each served tip.
Alternatives considered
- Call Ollama directly: cheaper in latency, but ties code to Ollama's API format and makes cloud fallback a code change.
- Call Anthropic directly from TS: violates the rule that TS services never hold model names (CLAUDE.md prime directive 3).