1
AI
alvis edited this page 2026-04-20 12:24:56 +00:00

AI

LLM infrastructure for Agap and consumers (oO, Adolf, Omo, Open WebUI, Pipecat voice).

Configuration is in agap_git/openai/ (compose + litellm config).

Stack

Service Purpose Port
ollama GPU LLM runtime (qwen, nomic-embed-text) 1143611434
ollama-cpu CPU-only runtime for small models 1143511434
litellm OpenAI-compatible proxy, model aliases, rate-limits 4000
litellm-db Postgres — LiteLLM key/usage persistence internal
langfuse LLM observability (traces, cost, feedback) 32003000
langfuse-db Postgres — Langfuse internal
qdrant Vector DB 6333, 6334
searxng Meta-search, pluggable to agents 114378080
open-webui Chat UI (see Open-WebUI) 3125
faster-whisper STT (Whisper v3-turbo, CUDA) 8880
silero-tts TTS (Silero v4) 8881
pipecat Voice pipeline (LiveKit + STT + TTS + LLM) 8882

Public endpoints

  • llm.alogins.net → LiteLLM (localhost:4000)
  • lf.alogins.net → Langfuse (localhost:3200)
  • ai.alogins.net → Open WebUI (localhost:3125)
  • voice.alogins.net → Pipecat (localhost:8882)

LiteLLM

OpenAI-compatible endpoint. All LLM traffic for Agap apps routes through it so model swaps are a config change, not a code change.

  • Config: agap_git/openai/litellm-config.yaml
  • Master key: set as LITELLM_MASTER_KEY env on the container
  • Langfuse callbacks for every request (success + failure)
  • Fallbacks defined per model (e.g. deepseek/deepseek-r1:freeollama/qwen3.5:4b)

Model aliases

oO recommender (used by ml/serving — see oO CLAUDE.md):

Alias Backend Notes
tip-generator ollama/qwen2.5:1.5b on host ollama :11434 Tip candidate generation
embedder ollama/nomic-embed-text on host ollama :11434 Task clustering, dedup
judge anthropic/claude-haiku-4-5-20251001 Offline eval only

Raw models (full list in litellm-config.yaml): qwen2.5 / qwen3 / gemma3 via ollama, plus OpenRouter free tier (llama, deepseek, gemma, mistral, nvidia nemotron, gpt-oss, minimax, hermes).

Quick test

curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
     -H "Content-Type: application/json" \
     https://llm.alogins.net/v1/chat/completions \
     -d '{"model":"tip-generator","messages":[{"role":"user","content":"hi"}]}'

Ollama

Two runtimes: a GPU one (main, :11436) and a CPU fallback (:11435). Models live under /mnt/ssd/ai/ollama and /mnt/ssd/ai/ollama-cpu.

GPU config (compose):

  • OLLAMA_MAX_LOADED_MODELS=2 — two models resident in VRAM
  • OLLAMA_NUM_PARALLEL=1 — serialise inference to avoid GPU contention
  • OLLAMA_NUM_GPU=999 — force all layers to GPU, fail rather than fall back to CPU
  • mem_limit: 4g

Pull / list

docker exec ollama ollama pull qwen2.5:1.5b
docker exec ollama ollama list

Langfuse

Observability for LiteLLM. Traces every completion with prompt, response, tokens, cost, latency. Signup disabled (AUTH_DISABLE_SIGNUP=true) — create users via admin.

  • Backend DB: langfuse-db (Postgres)
  • Callbacks wired through LiteLLM (litellm_settings.success_callback: [langfuse])

Data

Path Contents
/mnt/ssd/ai/ollama GPU ollama models
/mnt/ssd/ai/ollama-cpu CPU ollama models
/mnt/ssd/ai/open-webui Open WebUI state
/mnt/ssd/ai/searxng Searxng config + cache
/mnt/ssd/ai/faster-whisper Whisper model cache
/mnt/ssd/ai/silero-tts Silero torch cache
/mnt/ssd/dbs/litellm/postgres LiteLLM keys + usage
/mnt/ssd/dbs/langfuse/postgres Langfuse traces
/mnt/ssd/dbs/qdrant Vector store

Start / stop

cd ~/agap_git/openai
docker compose up -d                       # everything
docker compose up -d litellm               # minimal (brings up its deps)
docker compose stop litellm langfuse       # stop subset

Most services are restart: always.