Issues closed: #86, #87, #88, #89, #90, #91, #79, #80, #82 infra: - docker-compose `ai` profile: Ollama + LiteLLM services - infra/litellm/litellm_config.yaml: tip-generator / embedder / judge aliases - .env.example: LITELLM_URL, LITELLM_MASTER_KEY, OLLAMA_URL ml/serving: - POST /generate: calls LiteLLM tip-generator alias, returns TipCandidate[] - JSON retry loop (2 retries with correction prompt on malformed response) - _parse_llm_json strips markdown fences ml/features: - context.py: build_context() assembles user signals → PromptContext (sorts overdue/high-priority tasks first for LLM prompt quality) shared-types: - TipKind, TipSource, TipCandidate types - Tip gains kind + rationale fields services/api: - recommender: 3-stage pipeline (assemble → score → serve) Stage 1: Todoist tasks + LLM candidates fetched in parallel Stage 2: egreedy bandit scores merged candidate pool Stage 3: serve + log with prompt_version, llm_model, tip_kind - tip_scores: prompt_version, llm_model, tip_kind columns + migrations - config: LITELLM_URL added - integrations: surface token_status in /integrations response tests: - ml/serving/tests/test_generate.py: 13 tests (retry, 502/503, fence variants) - ml/features/test_context.py: 9 tests (sorting, edge cases) - services/api recommender.unit.test.ts: 16 pure-function tests (inferReward, dueAgeDays) - services/api recommender.test.ts: 4 integration tests (tip_scores columns, LLM fallback) - shared-types: TipCandidate, rationale, full TipFeedback action set docs: - ADR-0008: LiteLLM AI gateway decision - overview.md: M2 pipeline description updated - ml/README.md: serving + features roles updated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
42 lines
2.1 KiB
Markdown
42 lines
2.1 KiB
Markdown
# ADR-0008 — LiteLLM as AI gateway; model aliases decouple code from model names
|
|
|
|
**Status:** Accepted
|
|
**Date:** 2026-04-17
|
|
**Milestone:** M2
|
|
|
|
## Context
|
|
|
|
M2 requires LLM inference for tip generation (`ml/serving POST /generate`). We need a way to:
|
|
- Run locally during development without cloud API keys.
|
|
- Switch models (qwen2.5 → llama3.2, or cloud fallback) without touching application code.
|
|
- Share the LLM infrastructure with other local services on Agap.
|
|
|
|
## Decision
|
|
|
|
Route all LLM calls through **LiteLLM** (`http://localhost:4000` in dev, `llm.alogins.net` in prod) backed by **Ollama** for local inference.
|
|
|
|
Application code references model aliases — never bare model names:
|
|
|
|
| Alias | Default model | Used by |
|
|
|-------|--------------|---------|
|
|
| `tip-generator` | `qwen2.5:7b` | `ml/serving POST /generate` |
|
|
| `embedder` | `nomic-embed-text` | task clustering, dedup (M4) |
|
|
| `judge` | `claude-haiku-4-5` | offline simulation only |
|
|
|
|
Config is in `infra/litellm/litellm_config.yaml`. Swapping a model = one YAML change, zero code change.
|
|
|
|
`ml/serving` reads `LITELLM_URL` and `LITELLM_MASTER_KEY` from env. TypeScript services never call LLM endpoints directly — all inference flows through `ml/serving`.
|
|
|
|
## Consequences
|
|
|
|
- **Local dev:** `docker compose --profile ai up` starts Ollama + LiteLLM. First run pulls models (~4 GB for qwen2.5:7b).
|
|
- **Prod:** both are shared Agap services; set `LITELLM_URL=http://llm.alogins.net` in `.env.local`.
|
|
- **Offline sim:** `judge` alias points at `claude-haiku-4-5` (cloud) — requires `ANTHROPIC_API_KEY`; simulation is opt-in.
|
|
- **Vendor lock-in:** none at the code level. LiteLLM translates the OpenAI-compatible API to whatever backend.
|
|
- **Observability:** LiteLLM logs all requests; `tip_scores.llm_model` + `tip_scores.prompt_version` track which model + prompt generated each served tip.
|
|
|
|
## Alternatives considered
|
|
|
|
- **Call Ollama directly:** cheaper in latency, but ties code to Ollama's API format and makes cloud fallback a code change.
|
|
- **Call Anthropic directly from TS:** violates the rule that TS services never hold model names (CLAUDE.md prime directive 3).
|