Add AI page; document LiteLLM + Langfuse + Ollama stack; cross-link Open-WebUI

2026-04-20 12:24:56 +00:00
parent 2a5865d98a
commit 0479842e17
3 changed files with 119 additions and 22 deletions
--- a/AI.md
+++ b/AI.md
@@ -0,0 +1,108 @@
 # AI
 LLM infrastructure for Agap and consumers (oO, Adolf, Omo, Open WebUI, Pipecat voice).
 Configuration is in `agap_git/openai/` (compose + litellm config).
 ## Stack
 | Service | Purpose | Port |
 |---------|---------|------|
 | `ollama` | GPU LLM runtime (qwen, nomic-embed-text) | `11436` → `11434` |
 | `ollama-cpu` | CPU-only runtime for small models | `11435` → `11434` |
 | `litellm` | OpenAI-compatible proxy, model aliases, rate-limits | `4000` |
 | `litellm-db` | Postgres — LiteLLM key/usage persistence | internal |
 | `langfuse` | LLM observability (traces, cost, feedback) | `3200` → `3000` |
 | `langfuse-db` | Postgres — Langfuse | internal |
 | `qdrant` | Vector DB | `6333`, `6334` |
 | `searxng` | Meta-search, pluggable to agents | `11437` → `8080` |
 | `open-webui` | Chat UI (see [[Open-WebUI]]) | `3125` |
 | `faster-whisper` | STT (Whisper v3-turbo, CUDA) | `8880` |
 | `silero-tts` | TTS (Silero v4) | `8881` |
 | `pipecat` | Voice pipeline (LiveKit + STT + TTS + LLM) | `8882` |
 ## Public endpoints
 - `llm.alogins.net` → LiteLLM (`localhost:4000`)
 - `lf.alogins.net` → Langfuse (`localhost:3200`)
 - `ai.alogins.net` → Open WebUI (`localhost:3125`)
 - `voice.alogins.net` → Pipecat (`localhost:8882`)
 ## LiteLLM
 OpenAI-compatible endpoint. All LLM traffic for Agap apps routes through it so model swaps are a config change, not a code change.
 - Config: `agap_git/openai/litellm-config.yaml`
 - Master key: set as `LITELLM_MASTER_KEY` env on the container
 - Langfuse callbacks for every request (success + failure)
 - Fallbacks defined per model (e.g. `deepseek/deepseek-r1:free` → `ollama/qwen3.5:4b`)
 ### Model aliases
 **oO recommender** (used by `ml/serving` — see oO `CLAUDE.md`):
 | Alias | Backend | Notes |
 |-------|---------|-------|
 | `tip-generator` | `ollama/qwen2.5:1.5b` on host ollama `:11434` | Tip candidate generation |
 | `embedder` | `ollama/nomic-embed-text` on host ollama `:11434` | Task clustering, dedup |
 | `judge` | `anthropic/claude-haiku-4-5-20251001` | Offline eval only |
 **Raw models** (full list in `litellm-config.yaml`): qwen2.5 / qwen3 / gemma3 via ollama, plus OpenRouter free tier (llama, deepseek, gemma, mistral, nvidia nemotron, gpt-oss, minimax, hermes).
 ### Quick test
 ```bash
 curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
     -H "Content-Type: application/json" \
     https://llm.alogins.net/v1/chat/completions \
     -d '{"model":"tip-generator","messages":[{"role":"user","content":"hi"}]}'
 ```
 ## Ollama
 Two runtimes: a GPU one (main, `:11436`) and a CPU fallback (`:11435`). Models live under `/mnt/ssd/ai/ollama` and `/mnt/ssd/ai/ollama-cpu`.
 GPU config (compose):
 - `OLLAMA_MAX_LOADED_MODELS=2` — two models resident in VRAM
 - `OLLAMA_NUM_PARALLEL=1` — serialise inference to avoid GPU contention
 - `OLLAMA_NUM_GPU=999` — force all layers to GPU, fail rather than fall back to CPU
 - `mem_limit: 4g`
 ### Pull / list
 ```bash
 docker exec ollama ollama pull qwen2.5:1.5b
 docker exec ollama ollama list
 ```
 ## Langfuse
 Observability for LiteLLM. Traces every completion with prompt, response, tokens, cost, latency. Signup disabled (`AUTH_DISABLE_SIGNUP=true`) — create users via admin.
 - Backend DB: `langfuse-db` (Postgres)
 - Callbacks wired through LiteLLM (`litellm_settings.success_callback: [langfuse]`)
 ## Data
 | Path | Contents |
 |------|----------|
 | `/mnt/ssd/ai/ollama` | GPU ollama models |
 | `/mnt/ssd/ai/ollama-cpu` | CPU ollama models |
 | `/mnt/ssd/ai/open-webui` | Open WebUI state |
 | `/mnt/ssd/ai/searxng` | Searxng config + cache |
 | `/mnt/ssd/ai/faster-whisper` | Whisper model cache |
 | `/mnt/ssd/ai/silero-tts` | Silero torch cache |
 | `/mnt/ssd/dbs/litellm/postgres` | LiteLLM keys + usage |
 | `/mnt/ssd/dbs/langfuse/postgres` | Langfuse traces |
 | `/mnt/ssd/dbs/qdrant` | Vector store |
 ## Start / stop
 ```bash
 cd ~/agap_git/openai
 docker compose up -d                       # everything
 docker compose up -d litellm               # minimal (brings up its deps)
 docker compose stop litellm langfuse       # stop subset
 ```
 Most services are `restart: always`.
--- a/Home.md
+++ b/Home.md
@@ -11,7 +11,8 @@
 - [[Immich]] — Photo management and backup
 - [[Gitea]] — Git hosting
- [[Open-WebUI]] — AI chat interface
+- [[AI]] — LLM stack (LiteLLM, Langfuse, Ollama, STT/TTS)
 - [[Open-WebUI]] — AI chat interface (part of the AI stack)
 - [[Home-Assistant]] — KVM virtual machine
 - [[3X-UI]] — VPN proxy
 - [[Zabbix]] — Monitoring (Zabbix 7.4, PostgreSQL, Apache)
--- a/Open-WebUI.md
+++ b/Open-WebUI.md
@@ -1,35 +1,23 @@
 # Open WebUI
-AI chat interface with Ollama, GPU-accelerated via NVIDIA.
+Chat UI frontend to the [[AI]] stack. GPU-accelerated via NVIDIA Container Toolkit. Routes LLM calls through LiteLLM (`host.docker.internal:4000/v1`), STT via Faster-Whisper, TTS via Silero.
 ## Configuration
 Configuration is in the `agap_git` repository:
 - Compose file: `openai/docker-compose.yml`
 Repository: https://github.com/alvis/agap_git
 ## Access
- Web: `http://agap:3125`
+- `http://agap:3125`
 - `ai.alogins.net` (via Caddy)
 ## Configuration
 `agap_git/openai/docker-compose.yml` (service `open-webui`).
 ## Data
-| Path | Contents |
+- `/mnt/ssd/ai/open-webui` — user DB, chats, settings
 |------|----------|
 | Docker volume `ollama` | Ollama models |
 | Docker volume `open-webui` | Web UI data and settings |
-## GPU
+## GPU prerequisites
 Requires NVIDIA GPU support:
 ```bash
 sudo ./nvidia-docker-install.sh
 ./install-cuda.sh
 ```
 ## Stack
 - Open WebUI (NVIDIA/Ollama variant)
 - Ollama (LLM runtime)
 - NVIDIA Container Toolkit for GPU acceleration