oO/docs/adr/0008-litellm-ai-gateway.md

# ADR-0008 — LiteLLM as AI gateway; model aliases decouple code from model names

**Status:** Accepted
**Date:** 2026-04-17
**Milestone:** M2

## Context

M2 requires LLM inference for tip generation (`ml/serving POST /generate`). We need a way to:
- Run locally during development without cloud API keys.
- Switch models (qwen2.5 → llama3.2, or cloud fallback) without touching application code.
- Share the LLM infrastructure with other local services on Agap.

## Decision

Route all LLM calls through **LiteLLM** (`http://localhost:4000` in dev, `llm.alogins.net` in prod) backed by **Ollama** for local inference.

Application code references model aliases — never bare model names:

| Alias | Default model | Used by |
|-------|--------------|---------|
| `tip-generator` | `qwen2.5:7b` | `ml/serving POST /generate` |
| `embedder` | `nomic-embed-text` | task clustering, dedup (M4) |
| `judge` | `claude-haiku-4-5` | offline simulation only |

Config is in `infra/litellm/litellm_config.yaml`. Swapping a model = one YAML change, zero code change.

`ml/serving` reads `LITELLM_URL` and `LITELLM_MASTER_KEY` from env. TypeScript services never call LLM endpoints directly — all inference flows through `ml/serving`.

## Consequences

- **Local dev:** `docker compose --profile ai up` starts Ollama + LiteLLM. First run pulls models (~4 GB for qwen2.5:7b).
- **Prod:** both are shared Agap services; set `LITELLM_URL=http://llm.alogins.net` in `.env.local`.
- **Offline sim:** `judge` alias points at `claude-haiku-4-5` (cloud) — requires `ANTHROPIC_API_KEY`; simulation is opt-in.
- **Vendor lock-in:** none at the code level. LiteLLM translates the OpenAI-compatible API to whatever backend.
- **Observability:** LiteLLM logs all requests; `tip_scores.llm_model` + `tip_scores.prompt_version` track which model + prompt generated each served tip.

## Alternatives considered

- **Call Ollama directly:** cheaper in latency, but ties code to Ollama's API format and makes cloud fallback a code change.
- **Call Anthropic directly from TS:** violates the rule that TS services never hold model names (CLAUDE.md prime directive 3).