Add AI page; document LiteLLM + Langfuse + Ollama stack; cross-link Open-WebUI

2026-04-20 12:24:56 +00:00
parent 2a5865d98a
commit 0479842e17
3 changed files with 119 additions and 22 deletions
--- a/AI.md
+++ b/AI.md
@@ -0,0 +1,108 @@
+# AI
+
+LLM infrastructure for Agap and consumers (oO, Adolf, Omo, Open WebUI, Pipecat voice).
+
+Configuration is in `agap_git/openai/` (compose + litellm config).
+
+## Stack
+
+| Service | Purpose | Port |
+|---------|---------|------|
+| `ollama` | GPU LLM runtime (qwen, nomic-embed-text) | `11436` → `11434` |
+| `ollama-cpu` | CPU-only runtime for small models | `11435` → `11434` |
+| `litellm` | OpenAI-compatible proxy, model aliases, rate-limits | `4000` |
+| `litellm-db` | Postgres — LiteLLM key/usage persistence | internal |
+| `langfuse` | LLM observability (traces, cost, feedback) | `3200` → `3000` |
+| `langfuse-db` | Postgres — Langfuse | internal |
+| `qdrant` | Vector DB | `6333`, `6334` |
+| `searxng` | Meta-search, pluggable to agents | `11437` → `8080` |
+| `open-webui` | Chat UI (see [[Open-WebUI]]) | `3125` |
+| `faster-whisper` | STT (Whisper v3-turbo, CUDA) | `8880` |
+| `silero-tts` | TTS (Silero v4) | `8881` |
+| `pipecat` | Voice pipeline (LiveKit + STT + TTS + LLM) | `8882` |
+
+## Public endpoints
+
+- `llm.alogins.net` → LiteLLM (`localhost:4000`)
+- `lf.alogins.net` → Langfuse (`localhost:3200`)
+- `ai.alogins.net` → Open WebUI (`localhost:3125`)
+- `voice.alogins.net` → Pipecat (`localhost:8882`)
+
+## LiteLLM
+
+OpenAI-compatible endpoint. All LLM traffic for Agap apps routes through it so model swaps are a config change, not a code change.
+
+- Config: `agap_git/openai/litellm-config.yaml`
+- Master key: set as `LITELLM_MASTER_KEY` env on the container
+- Langfuse callbacks for every request (success + failure)
+- Fallbacks defined per model (e.g. `deepseek/deepseek-r1:free` → `ollama/qwen3.5:4b`)
+
+### Model aliases
+
+**oO recommender** (used by `ml/serving` — see oO `CLAUDE.md`):
+
+| Alias | Backend | Notes |
+|-------|---------|-------|
+| `tip-generator` | `ollama/qwen2.5:1.5b` on host ollama `:11434` | Tip candidate generation |
+| `embedder` | `ollama/nomic-embed-text` on host ollama `:11434` | Task clustering, dedup |
+| `judge` | `anthropic/claude-haiku-4-5-20251001` | Offline eval only |
+
+**Raw models** (full list in `litellm-config.yaml`): qwen2.5 / qwen3 / gemma3 via ollama, plus OpenRouter free tier (llama, deepseek, gemma, mistral, nvidia nemotron, gpt-oss, minimax, hermes).
+
+### Quick test
+
+```bash
+curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
+     -H "Content-Type: application/json" \
+     https://llm.alogins.net/v1/chat/completions \
+     -d '{"model":"tip-generator","messages":[{"role":"user","content":"hi"}]}'
+```
+
+## Ollama
+
+Two runtimes: a GPU one (main, `:11436`) and a CPU fallback (`:11435`). Models live under `/mnt/ssd/ai/ollama` and `/mnt/ssd/ai/ollama-cpu`.
+
+GPU config (compose):
+- `OLLAMA_MAX_LOADED_MODELS=2` — two models resident in VRAM
+- `OLLAMA_NUM_PARALLEL=1` — serialise inference to avoid GPU contention
+- `OLLAMA_NUM_GPU=999` — force all layers to GPU, fail rather than fall back to CPU
+- `mem_limit: 4g`
+
+### Pull / list
+
+```bash
+docker exec ollama ollama pull qwen2.5:1.5b
+docker exec ollama ollama list
+```
+
+## Langfuse
+
+Observability for LiteLLM. Traces every completion with prompt, response, tokens, cost, latency. Signup disabled (`AUTH_DISABLE_SIGNUP=true`) — create users via admin.
+
+- Backend DB: `langfuse-db` (Postgres)
+- Callbacks wired through LiteLLM (`litellm_settings.success_callback: [langfuse]`)
+
+## Data
+
+| Path | Contents |
+|------|----------|
+| `/mnt/ssd/ai/ollama` | GPU ollama models |
+| `/mnt/ssd/ai/ollama-cpu` | CPU ollama models |
+| `/mnt/ssd/ai/open-webui` | Open WebUI state |
+| `/mnt/ssd/ai/searxng` | Searxng config + cache |
+| `/mnt/ssd/ai/faster-whisper` | Whisper model cache |
+| `/mnt/ssd/ai/silero-tts` | Silero torch cache |
+| `/mnt/ssd/dbs/litellm/postgres` | LiteLLM keys + usage |
+| `/mnt/ssd/dbs/langfuse/postgres` | Langfuse traces |
+| `/mnt/ssd/dbs/qdrant` | Vector store |
+
+## Start / stop
+
+```bash
+cd ~/agap_git/openai
+docker compose up -d                       # everything
+docker compose up -d litellm               # minimal (brings up its deps)
+docker compose stop litellm langfuse       # stop subset
+```
+
+Most services are `restart: always`.
--- a/Home.md
+++ b/Home.md
@@ -11,7 +11,8 @@

 - [[Immich]] — Photo management and backup
 - [[Gitea]] — Git hosting
- [[Open-WebUI]] — AI chat interface
+- [[AI]] — LLM stack (LiteLLM, Langfuse, Ollama, STT/TTS)
+- [[Open-WebUI]] — AI chat interface (part of the AI stack)
 - [[Home-Assistant]] — KVM virtual machine
 - [[3X-UI]] — VPN proxy
 - [[Zabbix]] — Monitoring (Zabbix 7.4, PostgreSQL, Apache)
--- a/Open-WebUI.md
+++ b/Open-WebUI.md
@@ -1,35 +1,23 @@
 # Open WebUI

-AI chat interface with Ollama, GPU-accelerated via NVIDIA.
-
-## Configuration
-
-Configuration is in the `agap_git` repository:
- Compose file: `openai/docker-compose.yml`
-
-Repository: https://github.com/alvis/agap_git
+Chat UI frontend to the [[AI]] stack. GPU-accelerated via NVIDIA Container Toolkit. Routes LLM calls through LiteLLM (`host.docker.internal:4000/v1`), STT via Faster-Whisper, TTS via Silero.

 ## Access

- Web: `http://agap:3125`
+- `http://agap:3125`
+- `ai.alogins.net` (via Caddy)
+
+## Configuration
+
+`agap_git/openai/docker-compose.yml` (service `open-webui`).

 ## Data

-| Path | Contents |
-|------|----------|
-| Docker volume `ollama` | Ollama models |
-| Docker volume `open-webui` | Web UI data and settings |
+- `/mnt/ssd/ai/open-webui` — user DB, chats, settings

-## GPU
+## GPU prerequisites

-Requires NVIDIA GPU support:
 ```bash
 sudo ./nvidia-docker-install.sh
 ./install-cuda.sh
 ```
-
-## Stack
-
- Open WebUI (NVIDIA/Ollama variant)
- Ollama (LLM runtime)
- NVIDIA Container Toolkit for GPU acceleration