AI

LLM infrastructure for Agap and consumers (oO, Adolf, Omo, Open WebUI, Pipecat voice).

Configuration is in agap_git/openai/ (compose + litellm config).

Stack

Service	Purpose	Port
`ollama`	GPU LLM runtime (qwen, nomic-embed-text)	`11436` → `11434`
`ollama-cpu`	CPU-only runtime for small models	`11435` → `11434`
`litellm`	OpenAI-compatible proxy, model aliases, rate-limits	`4000`
`litellm-db`	Postgres — LiteLLM key/usage persistence	internal
`langfuse`	LLM observability (traces, cost, feedback)	`3200` → `3000`
`langfuse-db`	Postgres — Langfuse	internal
`qdrant`	Vector DB	`6333`, `6334`
`searxng`	Meta-search, pluggable to agents	`11437` → `8080`
`open-webui`	Chat UI (see Open-WebUI)	`3125`
`faster-whisper`	STT (Whisper v3-turbo, CUDA)	`8880`
`silero-tts`	TTS (Silero v4)	`8881`
`pipecat`	Voice pipeline (LiveKit + STT + TTS + LLM)	`8882`

Public endpoints

llm.alogins.net → LiteLLM (localhost:4000)
lf.alogins.net → Langfuse (localhost:3200)
ai.alogins.net → Open WebUI (localhost:3125)
voice.alogins.net → Pipecat (localhost:8882)

LiteLLM

OpenAI-compatible endpoint. All LLM traffic for Agap apps routes through it so model swaps are a config change, not a code change.

Config: agap_git/openai/litellm-config.yaml
Master key: set as LITELLM_MASTER_KEY env on the container
Langfuse callbacks for every request (success + failure)
Fallbacks defined per model (e.g. deepseek/deepseek-r1:free → ollama/qwen3.5:4b)

Model aliases

oO recommender (used by ml/serving — see oO CLAUDE.md):

Alias	Backend	Notes
`tip-generator`	`ollama/qwen2.5:1.5b` on host ollama `:11434`	Tip candidate generation
`embedder`	`ollama/nomic-embed-text` on host ollama `:11434`	Task clustering, dedup
`judge`	`anthropic/claude-haiku-4-5-20251001`	Offline eval only

Raw models (full list in litellm-config.yaml): qwen2.5 / qwen3 / gemma3 via ollama, plus OpenRouter free tier (llama, deepseek, gemma, mistral, nvidia nemotron, gpt-oss, minimax, hermes).

Quick test

curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
     -H "Content-Type: application/json" \
     https://llm.alogins.net/v1/chat/completions \
     -d '{"model":"tip-generator","messages":[{"role":"user","content":"hi"}]}'

Ollama

Two runtimes: a GPU one (main, :11436) and a CPU fallback (:11435). Models live under /mnt/ssd/ai/ollama and /mnt/ssd/ai/ollama-cpu.

GPU config (compose):

OLLAMA_MAX_LOADED_MODELS=2 — two models resident in VRAM
OLLAMA_NUM_PARALLEL=1 — serialise inference to avoid GPU contention
OLLAMA_NUM_GPU=999 — force all layers to GPU, fail rather than fall back to CPU
mem_limit: 4g

Pull / list

docker exec ollama ollama pull qwen2.5:1.5b
docker exec ollama ollama list

Langfuse

Observability for LiteLLM. Traces every completion with prompt, response, tokens, cost, latency. Signup disabled (AUTH_DISABLE_SIGNUP=true) — create users via admin.

Backend DB: langfuse-db (Postgres)
Callbacks wired through LiteLLM (litellm_settings.success_callback: [langfuse])

Data

Path	Contents
`/mnt/ssd/ai/ollama`	GPU ollama models
`/mnt/ssd/ai/ollama-cpu`	CPU ollama models
`/mnt/ssd/ai/open-webui`	Open WebUI state
`/mnt/ssd/ai/searxng`	Searxng config + cache
`/mnt/ssd/ai/faster-whisper`	Whisper model cache
`/mnt/ssd/ai/silero-tts`	Silero torch cache
`/mnt/ssd/dbs/litellm/postgres`	LiteLLM keys + usage
`/mnt/ssd/dbs/langfuse/postgres`	Langfuse traces
`/mnt/ssd/dbs/qdrant`	Vector store

Start / stop

cd ~/agap_git/openai
docker compose up -d                       # everything
docker compose up -d litellm               # minimal (brings up its deps)
docker compose stop litellm langfuse       # stop subset

Most services are restart: always.