Files
oO/docs/adr/0008-litellm-ai-gateway.md
alvis ffdf70733f feat: M2 AI tips — LiteLLM gateway, context assembler, end-to-end generation pipeline
Issues closed: #86, #87, #88, #89, #90, #91, #79, #80, #82

infra:
- docker-compose `ai` profile: Ollama + LiteLLM services
- infra/litellm/litellm_config.yaml: tip-generator / embedder / judge aliases
- .env.example: LITELLM_URL, LITELLM_MASTER_KEY, OLLAMA_URL

ml/serving:
- POST /generate: calls LiteLLM tip-generator alias, returns TipCandidate[]
- JSON retry loop (2 retries with correction prompt on malformed response)
- _parse_llm_json strips markdown fences

ml/features:
- context.py: build_context() assembles user signals → PromptContext
  (sorts overdue/high-priority tasks first for LLM prompt quality)

shared-types:
- TipKind, TipSource, TipCandidate types
- Tip gains kind + rationale fields

services/api:
- recommender: 3-stage pipeline (assemble → score → serve)
  Stage 1: Todoist tasks + LLM candidates fetched in parallel
  Stage 2: egreedy bandit scores merged candidate pool
  Stage 3: serve + log with prompt_version, llm_model, tip_kind
- tip_scores: prompt_version, llm_model, tip_kind columns + migrations
- config: LITELLM_URL added
- integrations: surface token_status in /integrations response

tests:
- ml/serving/tests/test_generate.py: 13 tests (retry, 502/503, fence variants)
- ml/features/test_context.py: 9 tests (sorting, edge cases)
- services/api recommender.unit.test.ts: 16 pure-function tests (inferReward, dueAgeDays)
- services/api recommender.test.ts: 4 integration tests (tip_scores columns, LLM fallback)
- shared-types: TipCandidate, rationale, full TipFeedback action set

docs:
- ADR-0008: LiteLLM AI gateway decision
- overview.md: M2 pipeline description updated
- ml/README.md: serving + features roles updated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:09:02 +00:00

2.1 KiB

ADR-0008 — LiteLLM as AI gateway; model aliases decouple code from model names

Status: Accepted
Date: 2026-04-17
Milestone: M2

Context

M2 requires LLM inference for tip generation (ml/serving POST /generate). We need a way to:

  • Run locally during development without cloud API keys.
  • Switch models (qwen2.5 → llama3.2, or cloud fallback) without touching application code.
  • Share the LLM infrastructure with other local services on Agap.

Decision

Route all LLM calls through LiteLLM (http://localhost:4000 in dev, llm.alogins.net in prod) backed by Ollama for local inference.

Application code references model aliases — never bare model names:

Alias Default model Used by
tip-generator qwen2.5:7b ml/serving POST /generate
embedder nomic-embed-text task clustering, dedup (M4)
judge claude-haiku-4-5 offline simulation only

Config is in infra/litellm/litellm_config.yaml. Swapping a model = one YAML change, zero code change.

ml/serving reads LITELLM_URL and LITELLM_MASTER_KEY from env. TypeScript services never call LLM endpoints directly — all inference flows through ml/serving.

Consequences

  • Local dev: docker compose --profile ai up starts Ollama + LiteLLM. First run pulls models (~4 GB for qwen2.5:7b).
  • Prod: both are shared Agap services; set LITELLM_URL=http://llm.alogins.net in .env.local.
  • Offline sim: judge alias points at claude-haiku-4-5 (cloud) — requires ANTHROPIC_API_KEY; simulation is opt-in.
  • Vendor lock-in: none at the code level. LiteLLM translates the OpenAI-compatible API to whatever backend.
  • Observability: LiteLLM logs all requests; tip_scores.llm_model + tip_scores.prompt_version track which model + prompt generated each served tip.

Alternatives considered

  • Call Ollama directly: cheaper in latency, but ties code to Ollama's API format and makes cloud fallback a code change.
  • Call Anthropic directly from TS: violates the rule that TS services never hold model names (CLAUDE.md prime directive 3).