Table of Contents
AI
LLM infrastructure for Agap and consumers (oO, Adolf, Omo, Open WebUI, Pipecat voice).
Configuration is in agap_git/openai/ (compose + litellm config).
Stack
| Service | Purpose | Port |
|---|---|---|
ollama |
GPU LLM runtime (qwen, nomic-embed-text) | 11436 → 11434 |
ollama-cpu |
CPU-only runtime for small models | 11435 → 11434 |
litellm |
OpenAI-compatible proxy, model aliases, rate-limits | 4000 |
litellm-db |
Postgres — LiteLLM key/usage persistence | internal |
langfuse |
LLM observability (traces, cost, feedback) | 3200 → 3000 |
langfuse-db |
Postgres — Langfuse | internal |
qdrant |
Vector DB | 6333, 6334 |
searxng |
Meta-search, pluggable to agents | 11437 → 8080 |
open-webui |
Chat UI (see Open-WebUI) | 3125 |
faster-whisper |
STT (Whisper v3-turbo, CUDA) | 8880 |
silero-tts |
TTS (Silero v4) | 8881 |
pipecat |
Voice pipeline (LiveKit + STT + TTS + LLM) | 8882 |
Public endpoints
llm.alogins.net→ LiteLLM (localhost:4000)lf.alogins.net→ Langfuse (localhost:3200)ai.alogins.net→ Open WebUI (localhost:3125)voice.alogins.net→ Pipecat (localhost:8882)
LiteLLM
OpenAI-compatible endpoint. All LLM traffic for Agap apps routes through it so model swaps are a config change, not a code change.
- Config:
agap_git/openai/litellm-config.yaml - Master key: set as
LITELLM_MASTER_KEYenv on the container - Langfuse callbacks for every request (success + failure)
- Fallbacks defined per model (e.g.
deepseek/deepseek-r1:free→ollama/qwen3.5:4b)
Model aliases
oO recommender (used by ml/serving — see oO CLAUDE.md):
| Alias | Backend | Notes |
|---|---|---|
tip-generator |
ollama/qwen2.5:1.5b on host ollama :11434 |
Tip candidate generation |
embedder |
ollama/nomic-embed-text on host ollama :11434 |
Task clustering, dedup |
judge |
anthropic/claude-haiku-4-5-20251001 |
Offline eval only |
Raw models (full list in litellm-config.yaml): qwen2.5 / qwen3 / gemma3 via ollama, plus OpenRouter free tier (llama, deepseek, gemma, mistral, nvidia nemotron, gpt-oss, minimax, hermes).
Quick test
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
https://llm.alogins.net/v1/chat/completions \
-d '{"model":"tip-generator","messages":[{"role":"user","content":"hi"}]}'
Ollama
Two runtimes: a GPU one (main, :11436) and a CPU fallback (:11435). Models live under /mnt/ssd/ai/ollama and /mnt/ssd/ai/ollama-cpu.
GPU config (compose):
OLLAMA_MAX_LOADED_MODELS=2— two models resident in VRAMOLLAMA_NUM_PARALLEL=1— serialise inference to avoid GPU contentionOLLAMA_NUM_GPU=999— force all layers to GPU, fail rather than fall back to CPUmem_limit: 4g
Pull / list
docker exec ollama ollama pull qwen2.5:1.5b
docker exec ollama ollama list
Langfuse
Observability for LiteLLM. Traces every completion with prompt, response, tokens, cost, latency. Signup disabled (AUTH_DISABLE_SIGNUP=true) — create users via admin.
- Backend DB:
langfuse-db(Postgres) - Callbacks wired through LiteLLM (
litellm_settings.success_callback: [langfuse])
Data
| Path | Contents |
|---|---|
/mnt/ssd/ai/ollama |
GPU ollama models |
/mnt/ssd/ai/ollama-cpu |
CPU ollama models |
/mnt/ssd/ai/open-webui |
Open WebUI state |
/mnt/ssd/ai/searxng |
Searxng config + cache |
/mnt/ssd/ai/faster-whisper |
Whisper model cache |
/mnt/ssd/ai/silero-tts |
Silero torch cache |
/mnt/ssd/dbs/litellm/postgres |
LiteLLM keys + usage |
/mnt/ssd/dbs/langfuse/postgres |
Langfuse traces |
/mnt/ssd/dbs/qdrant |
Vector store |
Start / stop
cd ~/agap_git/openai
docker compose up -d # everything
docker compose up -d litellm # minimal (brings up its deps)
docker compose stop litellm langfuse # stop subset
Most services are restart: always.