# ADR-0008 — LiteLLM as AI gateway; model aliases decouple code from model names **Status:** Accepted **Date:** 2026-04-17 **Milestone:** M2 ## Context M2 requires LLM inference for tip generation (`ml/serving POST /generate`). We need a way to: - Run locally during development without cloud API keys. - Switch models (qwen2.5 → llama3.2, or cloud fallback) without touching application code. - Share the LLM infrastructure with other local services on Agap. ## Decision Route all LLM calls through **LiteLLM** (`http://localhost:4000` in dev, `llm.alogins.net` in prod) backed by **Ollama** for local inference. Application code references model aliases — never bare model names: | Alias | Default model | Used by | |-------|--------------|---------| | `tip-generator` | `qwen2.5:7b` | `ml/serving POST /generate` | | `embedder` | `nomic-embed-text` | task clustering, dedup (M4) | | `judge` | `claude-haiku-4-5` | offline simulation only | Config is in `infra/litellm/litellm_config.yaml`. Swapping a model = one YAML change, zero code change. `ml/serving` reads `LITELLM_URL` and `LITELLM_MASTER_KEY` from env. TypeScript services never call LLM endpoints directly — all inference flows through `ml/serving`. ## Consequences - **Local dev:** `docker compose --profile ai up` starts Ollama + LiteLLM. First run pulls models (~4 GB for qwen2.5:7b). - **Prod:** both are shared Agap services; set `LITELLM_URL=http://llm.alogins.net` in `.env.local`. - **Offline sim:** `judge` alias points at `claude-haiku-4-5` (cloud) — requires `ANTHROPIC_API_KEY`; simulation is opt-in. - **Vendor lock-in:** none at the code level. LiteLLM translates the OpenAI-compatible API to whatever backend. - **Observability:** LiteLLM logs all requests; `tip_scores.llm_model` + `tip_scores.prompt_version` track which model + prompt generated each served tip. ## Alternatives considered - **Call Ollama directly:** cheaper in latency, but ties code to Ollama's API format and makes cloud fallback a code change. - **Call Anthropic directly from TS:** violates the rule that TS services never hold model names (CLAUDE.md prime directive 3).